Lab 3: CRISP-DM Capstone

Association Rule Mining, Clustering, or Collaborative Filtering

Ryan Bass, Brett Benefield, Cho Kim, Nicole Wittlin

Contents

In [38]:
# Display plots below cells
%matplotlib notebook

# Turn off annoying warnings
import warnings
warnings.filterwarnings("ignore")
In [39]:
import pprint
import pandas as pd
import numpy as np
import yellowbrick as yb
import matplotlib.pyplot as plt
import seaborn as sns
from math import sqrt
from pprint import pprint
from time import time
from datetime import datetime
from sklearn import metrics as mt
from sklearn import neighbors
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.feature_selection import VarianceThreshold, SelectFromModel, SelectPercentile, f_regression, mutual_info_regression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score, f1_score, roc_auc_score, mean_absolute_error, make_scorer, mean_squared_error, silhouette_score
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, Binarizer, scale
from sklearn.svm import LinearSVC, NuSVC, SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, kneighbors_graph
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier, RidgeClassifier, LinearRegression
from sklearn.linear_model import Ridge, Lasso, LassoCV
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, ShuffleSplit, StratifiedShuffleSplit, StratifiedKFold, GridSearchCV, cross_validate, RandomizedSearchCV, cross_val_score
from sklearn.cluster import KMeans, MiniBatchKMeans, SpectralClustering, DBSCAN, AgglomerativeClustering
from sklearn import cluster, mixture
from sklearn.neighbors import kneighbors_graph
from yellowbrick.classifier import ClassificationReport, ConfusionMatrix, ClassPredictionError, ROCAUC
from yellowbrick.features import Rank1D, Rank2D, RFECV
from yellowbrick.features.importances import FeatureImportances
from yellowbrick.model_selection import ValidationCurve, LearningCurve
from yellowbrick.regressor import PredictionError, ResidualsPlot
from yellowbrick.regressor.alphas import AlphaSelection
from yellowbrick.cluster import SilhouetteVisualizer, KElbowVisualizer
In [40]:
# Show all columns/rows in output
pd.set_option('max_rows', None)
pd.set_option('max_columns', None)
In [41]:
def scoreRF(X, y, cLabels):
    acc = cross_val_score(cls, X, y = y, cv = cv, scoring = 'accuracy', n_jobs=-1)
    if len(np.unique(cLabels)) > 1:
        sil = silhouette_score(dfDropped[featCols], cLabels)
    else:
        sil = -2
    result = {'mean': acc.mean(), 'std': acc.std(), 'sil': sil}
    
    return result

def scoreExtremeRF(X, y, cLabels):
    acc = cross_val_score(cls, X, y = y, cv = cv, scoring = 'accuracy', n_jobs=-1)
    if len(np.unique(cLabels)) > 1:
        sil = silhouette_score(dfExtremes[featCols], cLabels)
    else:
        sil = -2
    result = {'mean': acc.mean(), 'std': acc.std(), 'sil': sil}
    
    return result

Supporting Functions

In [42]:
# Brett's directory
# Laptop
%cd "C:\sandbox\SMU\dataMining\choNotebook\EducationDataNC\2017\Machine Learning Datasets"

# Ryan's directory
#%cd "C:\Users\Clovis\Documents\7331DataMining\EducationDataNC\2017\Machine Learning Datasets"

# Cho's directory
# cd "/Users/chostone/Documents/Data Mining/7331DataMining/EducationDataNC/2017/Machine Learning Datasets"

# NW directory
#%cd "C:\Users\Nicole Wittlin\Documents\Classes\MSDS7331\Project\2017\Machine Learning Datasets"

dfPublicHS = pd.read_csv("PublicHighSchools2017_ML.csv")
C:\sandbox\SMU\dataMining\choNotebook\EducationDataNC\2017\Machine Learning Datasets

Back to Top

Business Understanding

Data Collection Overview

The team selected data from the Belk Endowment Educational Attainment Data for North Carolina Public Schools, which contains the North Carolina Public Schools Report Card as well as the Statistical Profiles Databases. This data was compiled by Dr. Jake Drew from original sources provided by the Public Schools of North Carolina (http://ncpublicschools.org), and the compilation, research, and analysis of the educational attainment data was funded by the John M. Belk Endowment (JMBE).

JMBE’s mission is focused on postsecondary education in North Carolina to help underrepresented students access and complete postsecondary education and be better prepared for entering the workforce. The educational attainment data set contains comprehensive statistics, demographics, and achievement metrics about North Carolina public, charter, and alternative elementary, middle, and high schools. This wealth of data is the foundation for research to help JMBE understand trends and improve postsecondary pathways in the state.

Our team has selected the 2017 high school data and utilized the machine learning data set prepared by Dr. Drew for analysis. Throughout the semester, we have been exploring the relationship between enrollment in postsecondary education within 16 months of high school graduation and teacher metrics related to teacher education, licensing, and certification, in addition to years of experience. This is important to help identify both positive and negative factors influencing students’ enrollment decisions and understand how educators can impact the pipeline to higher education.

Using the insights gained throughout the semester, our team has discovered that the ACT Score of a school strongly influences the percentage of students that enroll in post-secondary education. Now, we want to understand what factors are impacting ACT scores at the school level, which is the focus of the analysis in this lab. Our goals are to identify school features that correlate with schools that perform well on the ACT on average and those schools that perform poorly on the ACT on average. This information will allow schools to focus on specific areas to ultimately improve their overall performance. Ultimately, this will allow NC public schools to address what has been deemed the "leaky pipeline."

Algorithm Effectiveness

The team is using a two-pronged approach to mine the ACT data, starting with cluster analysis and then employing Random Forest for classification. Depending on the model, there are different metrics that can be used to analyze the effectiveness and predictive capability, and the most common metrics are outlined below.

Clustering Algorithm

Silhouette Analysis

Silhouette analysis is a method used to understand the validity and interpretation of clustering by balancing cohesion (how closely related objects are in a cluster) and separation (how distinct or well-separated a cluster is from other clusters). It measures the separation distance between resulting clusters, or how close each point in one cluster is to points in neighboring clusters. The metric, known as silhouette coefficients, range from -1 to 1. A coefficient of 0 indicates that the sample is either very close or completely aligned with the decision boundary between two neighboring clusters; a negative value indicates that samples may have been assigned to incorrect clusters. A positive value is ideal, with 1 being the optimal value.

Citations:
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
https://en.wikipedia.org/wiki/Silhouette_(clustering)
• Tan, Pang-Ning, Steinbach, Michael, and Kumar, Vipin (2006), Introduction To Data Mining (1st ed.), Boston, MA: Person Education.

Classification Algorithms

Confusion Matrix

The confusion matrix is a table that summarizes the performance of a classification model based on the count of test records that are correctly and incorrectly predicted by the model. This summary becomes the basis to calculate additional metrics such as accuracy, precision, recall, and the F-Score.

CONFUSION MATRIX Predicted Yes Predicted No
Actual Yes True Positive (TP) False Negative (FN)
Actual No False Positive (FP) True Negative (TN)

Accuracy

Accuracy is a typical measurement used to evaluate classification models and is consider to be a good measure despite a few limitations. It is calculated by dividing the number of correct predictions by the total number of predictions, or TP + TN / TP + TN + FP + FN. The error rate is essentially the opposite of accuracy and looks at incorrect predictions. Most classification algorithms seek models that attain the highest accuracy, or equivalently, the lowest error rate when applied to the test set.

One obvious limitation of accuracy is that it ignores the cost of misclassification, and this is particularly evident when the algorithm is trying to predict on imbalanced data sets. This can be addressed by looking at other metrics as part of the evaluation process.

Precision

Precision is a widely used classification metric where the successful detection of one class is more important than another. It is a cost-sensitive measure for the fraction of actual positive records in the group that the classifier declared in the positive class. The calculation for precision is TP / TP + FP. Higher precision means lower false positives, and the metric tends to be biased toward true positives. A good model will maximize precision.

Recall

Similar to precision, recall is also a widely used cost-sensitive classification metric. It measures the fraction of positive examples correctly predicted by the model, where large recall measures have few positive examples misclassified as negative, or lower false negatives. The calculation is TP / TP + FN, and the measure penalizes the model when it yields a negative when the true result is positive. Again, a good model will maximize recall.

F-Score

Precision and recall can be summarized into the F-score metric (also known as the F-1 measure or F-measure). Generally, this score is a weighted accuracy measure that takes into account both precision and recall. It is calculated as the harmonic mean between the two: 2 x TP / 2 x TP + FP + FN. This too is a metric to maximize; a high value F-score indicates that precision and recall are reasonably high.

Citation:
• Tan, Pang-Ning, Steinbach, Michael, and Kumar, Vipin (2006), Introduction To Data Mining (1st ed.), Boston, MA: Person Education.

Cross Validation

In this lab, we used cross validation (CV) as a way to measure the performance of our models. We used 10 folds to limit the probability of selecting data that is biased. We also defined our test/training as a random shuffle split of 20% / 80%. By using CV and shuffle splitting, it is more likely that our models will perform similarly when applied to new data sets. We want our model to be able to accurately identify schools that are in the top quartile and the bottom quartile of ACT performance, on both training data and test data. If schools were incorrectly identified, then time, money, and/or resources would be wasted on schools that are performing adequately. This allows the greatest impact to be made on low performing schools, which is why it was crucial to include cross validation in our training and test sets.

Citation:
https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f

Back to Top

Data Understanding

Data Understanding 1

The data for our analysis is from the Belk Endowment Educational Attainment Data Repository for North Carolina Public Schools. The data contains information about "public, charter, and alternative schools in the State of North Carolina." We are using Dr. Jake Drew's NC Education Data available on GitHub: https://github.com/jakemdrew/EducationDataNC.

The Machine Learning Datasets created by Dr. Drew have already been preprocessed for machine learning and have gone through the following processes (reference: http://nbviewer.jupyter.org/github/jakemdrew/EducationDataNC/blob/master/2017/Machine%20Learning%20Datasets/Source%20Code/PublicHighSchools2017_ML.ipynb)

  1. Missing student body racial compositions are imputed using district averages.
  2. Columns that have the same value in every single row are deleted.
  3. Columns that have a unique value in every single row (all values are different) are deleted.
  4. Empty columns (all values are NA or NULL) are deleted.
  5. Numeric columns with more than the percentage of missing values specified by the missingThreshold parameter.
  6. Remaining numeric, non-race columns with missing values are imputed / populated with 0. In many cases, schools are not reporting values when they are zero. However, mean imputation or some other more sophisticated strategy might be considered here.
  7. Categorical / text based columns with > uniqueThreshold unique values are deleted.
  8. All remaining categorical / text based columns are one-hot encoded. In categorical columns, one-hot encoding creates one new boolean / binary field per unique value in the target column, converting all categorical columns to a numeric data type.
  9. Duplicated or highly similar columns with > 95% correlation are delelted.

Above list is from the PublicHighSchools2017_ML.ipynb Notebook

For our analysis, we used the PublicHighSchools2017_ML.csv file. This file only contains information regarding public high schools in North Carolina for the 2016-2017 school year.

Data Preprocessing

We replaced all non-alphanumeric characters with underscores.

In [43]:
# Replace all non alphanumeric characters with underscores
dfPublicHS.columns = dfPublicHS.columns.str.replace(r'\W', "_")

We changed all data types as floats since some libraries only work with floats. We are also treating the unit_code column as a string.

In [44]:
# Change all columns to floats since some libraries only work with floats
dfPublicHS = dfPublicHS.astype(float)

# Treat unit_code as a string
dfPublicHS["unit_code"] = dfPublicHS.astype({"unit_code": str})

We decided to drop any remaining variables related to the ACT such as ACT benchmarks and individual subject scores to not bias our model. Doing this also reduces multicollinearity.

In [45]:
#want to delete any remaining variables related to the ACT score (such as ACT benchmarks) to not bias our model
dfDropped = dfPublicHS

temp = dfDropped['ACT_Score']

dropCols = dfDropped.filter(regex = r'ACT')

dfDropped.drop(dropCols, axis = 1, inplace = True)

dfDropped['ACT_Score'] = temp

This is a list of columns that were deleted. The ACT score was put back into the dfDropped data frame.

In [46]:
#list of all the columns that were deleted (note that ACT Score was put back into dataframe that is being used)
dropCols.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 470 entries, 0 to 469
Data columns (total 87 columns):
ACT_Score                                   470 non-null float64
ACT_WorkKeys_Score                          470 non-null float64
ACTMath_ACTBenchmark_All                    470 non-null float64
ACTScience_ACTBenchmark_All                 470 non-null float64
ACTWorkKeys_SilverPlus_All                  470 non-null float64
ACTWriting_ACTBenchmark_All                 470 non-null float64
ACTCompositeScore_UNCMin_Female             470 non-null float64
ACTEnglish_ACTBenchmark_Female              470 non-null float64
ACTMath_ACTBenchmark_Female                 470 non-null float64
ACTReading_ACTBenchmark_Female              470 non-null float64
ACTScience_ACTBenchmark_Female              470 non-null float64
ACTWorkKeys_SilverPlus_Female               470 non-null float64
ACTCompositeScore_UNCMin_Male               470 non-null float64
ACTMath_ACTBenchmark_Male                   470 non-null float64
ACTScience_ACTBenchmark_Male                470 non-null float64
ACTWorkKeys_SilverPlus_Male                 470 non-null float64
ACTWriting_ACTBenchmark_Male                470 non-null float64
ACTCompositeScore_UNCMin_AmericanIndian     470 non-null float64
ACTMath_ACTBenchmark_AmericanIndian         470 non-null float64
ACTSubtests_BenchmarksMet_AmericanIndian    470 non-null float64
ACTWorkKeys_SilverPlus_AmericanIndian       470 non-null float64
ACTWriting_ACTBenchmark_AmericanIndian      470 non-null float64
ACTCompositeScore_UNCMin_Asian              470 non-null float64
ACTSubtests_BenchmarksMet_Asian             470 non-null float64
ACTWorkKeys_SilverPlus_Asian                470 non-null float64
ACTWriting_ACTBenchmark_Asian               470 non-null float64
ACTCompositeScore_UNCMin_Black              470 non-null float64
ACTEnglish_ACTBenchmark_Black               470 non-null float64
ACTMath_ACTBenchmark_Black                  470 non-null float64
ACTReading_ACTBenchmark_Black               470 non-null float64
ACTScience_ACTBenchmark_Black               470 non-null float64
ACTSubtests_BenchmarksMet_Black             470 non-null float64
ACTWorkKeys_SilverPlus_Black                470 non-null float64
ACTWriting_ACTBenchmark_Black               470 non-null float64
ACTCompositeScore_UNCMin_Hispanic           470 non-null float64
ACTEnglish_ACTBenchmark_Hispanic            470 non-null float64
ACTMath_ACTBenchmark_Hispanic               470 non-null float64
ACTReading_ACTBenchmark_Hispanic            470 non-null float64
ACTScience_ACTBenchmark_Hispanic            470 non-null float64
ACTSubtests_BenchmarksMet_Hispanic          470 non-null float64
ACTWorkKeys_SilverPlus_Hispanic             470 non-null float64
ACTWriting_ACTBenchmark_Hispanic            470 non-null float64
ACTCompositeScore_UNCMin_TwoorMoreRaces     470 non-null float64
ACTMath_ACTBenchmark_TwoorMoreRaces         470 non-null float64
ACTReading_ACTBenchmark_TwoorMoreRaces      470 non-null float64
ACTScience_ACTBenchmark_TwoorMoreRaces      470 non-null float64
ACTSubtests_BenchmarksMet_TwoorMoreRaces    470 non-null float64
ACTWorkKeys_SilverPlus_TwoorMoreRaces       470 non-null float64
ACTWriting_ACTBenchmark_TwoorMoreRaces      470 non-null float64
ACTCompositeScore_UNCMin_White              470 non-null float64
ACTMath_ACTBenchmark_White                  470 non-null float64
ACTScience_ACTBenchmark_White               470 non-null float64
ACTSubtests_BenchmarksMet_White             470 non-null float64
ACTWorkKeys_SilverPlus_White                470 non-null float64
ACTWriting_ACTBenchmark_White               470 non-null float64
ACTCompositeScore_UNCMin_EDS                470 non-null float64
ACTEnglish_ACTBenchmark_EDS                 470 non-null float64
ACTMath_ACTBenchmark_EDS                    470 non-null float64
ACTReading_ACTBenchmark_EDS                 470 non-null float64
ACTScience_ACTBenchmark_EDS                 470 non-null float64
ACTSubtests_BenchmarksMet_EDS               470 non-null float64
ACTWorkKeys_SilverPlus_EDS                  470 non-null float64
ACTWriting_ACTBenchmark_EDS                 470 non-null float64
ACTCompositeScore_UNCMin_LEP                470 non-null float64
ACTEnglish_ACTBenchmark_LEP                 470 non-null float64
ACTMath_ACTBenchmark_LEP                    470 non-null float64
ACTReading_ACTBenchmark_LEP                 470 non-null float64
ACTScience_ACTBenchmark_LEP                 470 non-null float64
ACTSubtests_BenchmarksMet_LEP               470 non-null float64
ACTWorkKeys_SilverPlus_LEP                  470 non-null float64
ACTWriting_ACTBenchmark_LEP                 470 non-null float64
ACTCompositeScore_UNCMin_SWD                470 non-null float64
ACTEnglish_ACTBenchmark_SWD                 470 non-null float64
ACTMath_ACTBenchmark_SWD                    470 non-null float64
ACTReading_ACTBenchmark_SWD                 470 non-null float64
ACTScience_ACTBenchmark_SWD                 470 non-null float64
ACTSubtests_BenchmarksMet_SWD               470 non-null float64
ACTWorkKeys_SilverPlus_SWD                  470 non-null float64
ACTWriting_ACTBenchmark_SWD                 470 non-null float64
ACTCompositeScore_UNCMin_AIG                470 non-null float64
ACTMath_ACTBenchmark_AIG                    470 non-null float64
ACTScience_ACTBenchmark_AIG                 470 non-null float64
ACTSubtests_BenchmarksMet_AIG               470 non-null float64
ACTWorkKeys_SilverPlus_AIG                  470 non-null float64
ACTWriting_ACTBenchmark_AIG                 470 non-null float64
ACT_pTarget_PctMet                          470 non-null float64
ACTWorkKeys_pTarget_PctMet                  470 non-null float64
dtypes: float64(87)
memory usage: 319.5 KB

Data Definitions and Summary Statistics

Here is a summary of the dataset, with minimum and maximum values, which gives an idea of the scale and values for each variable.

In [47]:
dfDropped.describe(include='all').transpose()
Out[47]:
count unique top freq mean std min 25% 50% 75% max
student_num 470.0 NaN NaN NaN 834.336170 593.357073 8.000000 312.750000 758.500000 1208.500000 2966.000000
lea_avg_student_num 470.0 NaN NaN NaN 823.078723 360.428092 105.000000 577.000000 810.000000 974.000000 1852.000000
st_avg_student_num 470.0 NaN NaN NaN 833.417021 97.416634 278.000000 853.000000 853.000000 853.000000 853.000000
09_Size 470.0 NaN NaN NaN 18.251064 8.414169 0.000000 16.000000 20.000000 23.000000 94.000000
10_Size 470.0 NaN NaN NaN 17.534043 9.269129 0.000000 15.000000 20.000000 23.000000 92.000000
11_Size 470.0 NaN NaN NaN 17.117021 8.807896 0.000000 14.000000 20.000000 24.000000 30.000000
12_Size 470.0 NaN NaN NaN 15.623404 9.262385 0.000000 11.000000 18.000000 22.000000 36.000000
Biology_Size 470.0 NaN NaN NaN 18.159574 5.636281 0.000000 15.000000 19.000000 22.000000 30.000000
English_II_Size 470.0 NaN NaN NaN 19.110638 5.644826 0.000000 17.000000 20.000000 23.000000 30.000000
Math_I_Size 470.0 NaN NaN NaN 17.991489 5.559518 0.000000 16.000000 18.000000 21.750000 32.000000
lea_total_expense_num 470.0 NaN NaN NaN 9451.491362 1217.044567 8150.840000 8662.317500 9148.790000 9766.770000 17718.540000
lea_salary_expense_pct 470.0 NaN NaN NaN 0.832023 0.025146 0.761000 0.816000 0.831000 0.852000 0.878000
lea_services_expense_pct 470.0 NaN NaN NaN 0.080360 0.016874 0.052000 0.069000 0.079000 0.087000 0.148000
lea_supplies_expense_pct 470.0 NaN NaN NaN 0.077851 0.014307 0.028000 0.069000 0.079000 0.091000 0.117000
lea_instruct_equip_exp_pct 470.0 NaN NaN NaN 0.009766 0.006818 0.001000 0.006000 0.009000 0.012000 0.045000
lea_federal_perpupil_num 470.0 NaN NaN NaN 1102.743894 328.272771 518.190000 909.817500 1069.730000 1221.270000 2670.310000
lea_local_perpupil_num 470.0 NaN NaN NaN 2098.410660 732.479671 848.850000 1655.070000 1933.510000 2412.730000 6150.800000
lea_state_perpupil_num 470.0 NaN NaN NaN 6250.336809 915.740784 5342.870000 5685.100000 5953.920000 6494.750000 12794.690000
SPG_Score 470.0 NaN NaN NaN 72.729787 13.171874 9.000000 64.000000 72.000000 82.000000 100.000000
EVAAS_Growth_Score 470.0 NaN NaN NaN 75.405957 18.126891 0.000000 64.475000 79.550000 87.575000 100.000000
NC_Math_1_Score 470.0 NaN NaN NaN 57.842553 22.915174 0.000000 43.000000 56.000000 75.000000 100.000000
English_II_Score 470.0 NaN NaN NaN 61.068085 21.922709 0.000000 48.000000 60.000000 75.000000 100.000000
Biology_Score 470.0 NaN NaN NaN 56.582979 22.622246 0.000000 42.000000 57.500000 72.000000 100.000000
Passing_NC_Math_3 470.0 NaN NaN NaN 95.942553 17.088722 0.000000 100.000000 100.000000 100.000000 100.000000
4_Year_Cohort_Graduation_Rate_Score 470.0 NaN NaN NaN 87.931915 17.511727 0.000000 85.000000 91.000000 100.000000 100.000000
EOCSubjects_CACR_All 470.0 NaN NaN NaN 48.403617 20.928241 0.000000 34.000000 45.600000 62.175000 100.000000
GraduationRate_5yr_All 470.0 NaN NaN NaN 86.147660 22.002872 0.000000 85.400000 90.300000 100.000000 100.000000
EOCBiology_CACR_Female 470.0 NaN NaN NaN 46.260426 22.963376 0.000000 30.575000 47.050000 62.350000 100.000000
EOCEnglish2_CACR_Female 470.0 NaN NaN NaN 53.501489 23.159822 0.000000 39.200000 53.300000 70.650000 100.000000
GraduationRate_4yr_Female 470.0 NaN NaN NaN 88.285745 21.452757 0.000000 87.625000 92.750000 100.000000 100.000000
GraduationRate_5yr_Female 470.0 NaN NaN NaN 86.274468 25.283799 0.000000 87.225000 92.350000 100.000000 100.000000
EOCBiology_CACR_Male 470.0 NaN NaN NaN 45.360426 25.095552 0.000000 28.525000 44.500000 62.600000 100.000000
EOCEnglish2_CACR_Male 470.0 NaN NaN NaN 44.504894 23.826171 0.000000 29.625000 40.950000 57.575000 100.000000
EOCMathI_CACR_Male 470.0 NaN NaN NaN 43.722553 25.370064 0.000000 25.925000 39.700000 58.150000 100.000000
GraduationRate_4yr_Male 470.0 NaN NaN NaN 83.155745 22.982269 0.000000 81.425000 88.200000 94.100000 100.000000
GraduationRate_5yr_Male 470.0 NaN NaN NaN 80.832340 26.876277 0.000000 81.525000 87.500000 94.100000 100.000000
EOCBiology_CACR_AmericanIndian 470.0 NaN NaN NaN 1.238511 7.021255 0.000000 0.000000 0.000000 0.000000 76.500000
EOCEnglish2_CACR_AmericanIndian 470.0 NaN NaN NaN 1.187234 6.527381 0.000000 0.000000 0.000000 0.000000 72.700000
EOCMathI_CACR_AmericanIndian 470.0 NaN NaN NaN 1.437234 7.920304 0.000000 0.000000 0.000000 0.000000 62.500000
EOCSubjects_CACR_AmericanIndian 470.0 NaN NaN NaN 3.816809 12.956514 0.000000 0.000000 0.000000 0.000000 86.700000
GraduationRate_4yr_AmericanIndian 470.0 NaN NaN NaN 3.246170 16.504871 0.000000 0.000000 0.000000 0.000000 100.000000
GraduationRate_5yr_AmericanIndian 470.0 NaN NaN NaN 3.172766 16.059148 0.000000 0.000000 0.000000 0.000000 100.000000
EOCBiology_CACR_Asian 470.0 NaN NaN NaN 10.014255 25.728790 0.000000 0.000000 0.000000 0.000000 100.000000
EOCEnglish2_CACR_Asian 470.0 NaN NaN NaN 10.130426 25.372320 0.000000 0.000000 0.000000 0.000000 100.000000
EOCMathI_CACR_Asian 470.0 NaN NaN NaN 10.281702 26.718175 0.000000 0.000000 0.000000 0.000000 100.000000
EOCSubjects_CACR_Asian 470.0 NaN NaN NaN 23.438723 34.738576 0.000000 0.000000 0.000000 54.500000 100.000000
GraduationRate_4yr_Asian 470.0 NaN NaN NaN 13.363617 33.244258 0.000000 0.000000 0.000000 0.000000 100.000000
GraduationRate_5yr_Asian 470.0 NaN NaN NaN 12.607447 32.445970 0.000000 0.000000 0.000000 0.000000 100.000000
EOCBiology_CACR_Black 470.0 NaN NaN NaN 21.757234 22.148368 0.000000 0.000000 17.300000 33.225000 100.000000
EOCEnglish2_CACR_Black 470.0 NaN NaN NaN 25.151702 22.598941 0.000000 0.000000 22.550000 36.050000 100.000000
EOCMathI_CACR_Black 470.0 NaN NaN NaN 21.730426 22.015915 0.000000 0.000000 18.200000 31.125000 100.000000
EOCSubjects_CACR_Black 470.0 NaN NaN NaN 27.654255 22.143018 0.000000 13.400000 23.550000 37.050000 100.000000
GraduationRate_4yr_Black 470.0 NaN NaN NaN 63.867660 42.184616 0.000000 0.000000 86.200000 94.400000 100.000000
GraduationRate_5yr_Black 470.0 NaN NaN NaN 62.505957 42.541215 0.000000 0.000000 85.800000 93.800000 100.000000
EOCBiology_CACR_Hispanic 470.0 NaN NaN NaN 27.338723 24.712108 0.000000 0.000000 25.000000 42.900000 100.000000
EOCEnglish2_CACR_Hispanic 470.0 NaN NaN NaN 30.145532 24.740157 0.000000 0.000000 30.000000 44.550000 100.000000
EOCMathI_CACR_Hispanic 470.0 NaN NaN NaN 29.359574 24.705322 0.000000 0.000000 29.050000 42.825000 100.000000
EOCSubjects_CACR_Hispanic 470.0 NaN NaN NaN 37.599149 22.730571 0.000000 24.700000 35.750000 48.950000 100.000000
GraduationRate_4yr_Hispanic 470.0 NaN NaN NaN 55.881702 41.982791 0.000000 0.000000 79.000000 89.950000 100.000000
GraduationRate_5yr_Hispanic 470.0 NaN NaN NaN 52.496383 43.158125 0.000000 0.000000 77.300000 89.675000 100.000000
EOCBiology_CACR_TwoorMoreRaces 470.0 NaN NaN NaN 14.547872 24.012428 0.000000 0.000000 0.000000 29.400000 100.000000
EOCEnglish2_CACR_TwoorMoreRaces 470.0 NaN NaN NaN 15.751064 24.958308 0.000000 0.000000 0.000000 35.600000 100.000000
EOCMathI_CACR_TwoorMoreRaces 470.0 NaN NaN NaN 14.822766 22.620617 0.000000 0.000000 0.000000 30.225000 90.900000
EOCSubjects_CACR_TwoorMoreRaces 470.0 NaN NaN NaN 29.922553 26.701570 0.000000 0.000000 31.550000 50.000000 100.000000
GraduationRate_4yr_TwoorMoreRaces 470.0 NaN NaN NaN 26.548085 40.767276 0.000000 0.000000 0.000000 76.800000 100.000000
GraduationRate_5yr_TwoorMoreRaces 470.0 NaN NaN NaN 26.239787 40.868729 0.000000 0.000000 0.000000 78.600000 100.000000
EOCBiology_CACR_White 470.0 NaN NaN NaN 52.112340 25.118783 0.000000 40.525000 54.200000 68.950000 100.000000
EOCEnglish2_CACR_White 470.0 NaN NaN NaN 54.517872 26.588834 0.000000 43.100000 55.350000 73.025000 100.000000
EOCMathI_CACR_White 470.0 NaN NaN NaN 48.677021 25.591518 0.000000 34.650000 48.950000 66.375000 100.000000
EOCSubjects_CACR_White 470.0 NaN NaN NaN 55.477872 21.567189 0.000000 42.900000 54.800000 70.700000 100.000000
GraduationRate_4yr_White 470.0 NaN NaN NaN 81.973404 28.163214 0.000000 83.925000 90.200000 100.000000 100.000000
GraduationRate_5yr_White 470.0 NaN NaN NaN 80.649787 29.648325 0.000000 83.375000 90.050000 100.000000 100.000000
EOCBiology_CACR_EDS 470.0 NaN NaN NaN 34.594255 22.284915 0.000000 20.000000 32.000000 46.525000 100.000000
EOCEnglish2_CACR_EDS 470.0 NaN NaN NaN 37.509149 22.024282 0.000000 24.525000 32.900000 47.075000 100.000000
EOCMathI_CACR_EDS 470.0 NaN NaN NaN 34.671064 22.543040 0.000000 20.525000 30.000000 45.800000 100.000000
EOCSubjects_CACR_EDS 470.0 NaN NaN NaN 37.459149 20.836071 0.000000 24.725000 32.500000 45.275000 100.000000
GraduationRate_4yr_EDS 470.0 NaN NaN NaN 79.374894 26.773943 0.000000 79.700000 86.100000 92.175000 100.000000
GraduationRate_5yr_EDS 470.0 NaN NaN NaN 78.444043 28.150718 0.000000 80.025000 86.050000 92.100000 100.000000
EOCBiology_CACR_LEP 470.0 NaN NaN NaN 1.406809 5.558067 0.000000 0.000000 0.000000 0.000000 57.100000
EOCBiology_GLP_LEP 470.0 NaN NaN NaN 2.132979 6.971829 0.000000 0.000000 0.000000 0.000000 65.000000
EOCEnglish2_CACR_LEP 470.0 NaN NaN NaN 0.963617 4.180273 0.000000 0.000000 0.000000 0.000000 50.000000
EOCEnglish2_GLP_LEP 470.0 NaN NaN NaN 1.920213 5.924298 0.000000 0.000000 0.000000 0.000000 50.000000
EOCMathI_CACR_LEP 470.0 NaN NaN NaN 3.896383 9.188141 0.000000 0.000000 0.000000 0.000000 64.500000
EOCMathI_GLP_LEP 470.0 NaN NaN NaN 6.670638 13.059566 0.000000 0.000000 0.000000 10.000000 77.400000
EOCSubjects_CACR_LEP 470.0 NaN NaN NaN 4.432979 8.663636 0.000000 0.000000 0.000000 7.450000 59.400000
EOCSubjects_GLP_LEP 470.0 NaN NaN NaN 7.748936 11.919940 0.000000 0.000000 0.000000 13.125000 72.700000
EOG_EOCSubjects_GLP_LEP 470.0 NaN NaN NaN 8.149787 12.292782 0.000000 0.000000 0.000000 13.500000 72.700000
GraduationRate_4yr_LEP 470.0 NaN NaN NaN 7.784894 20.775824 0.000000 0.000000 0.000000 0.000000 100.000000
GraduationRate_5yr_LEP 470.0 NaN NaN NaN 7.372128 21.421025 0.000000 0.000000 0.000000 0.000000 100.000000
EOCBiology_CACR_SWD 470.0 NaN NaN NaN 9.497660 10.903168 0.000000 0.000000 7.100000 16.700000 68.400000
EOCBiology_GLP_SWD 470.0 NaN NaN NaN 13.921915 13.812176 0.000000 0.000000 11.950000 22.200000 76.200000
EOCEnglish2_CACR_SWD 470.0 NaN NaN NaN 7.504468 8.802193 0.000000 0.000000 5.900000 12.500000 52.200000
EOCEnglish2_GLP_SWD 470.0 NaN NaN NaN 12.009149 12.089260 0.000000 0.000000 10.600000 20.000000 73.900000
EOCMathI_CACR_SWD 470.0 NaN NaN NaN 6.443830 8.391197 0.000000 0.000000 0.000000 11.100000 50.000000
EOCMathI_GLP_SWD 470.0 NaN NaN NaN 11.833830 12.106117 0.000000 0.000000 10.300000 19.000000 65.000000
EOCSubjects_CACR_SWD 470.0 NaN NaN NaN 9.042979 10.657741 0.000000 0.000000 7.900000 14.175000 84.200000
EOCSubjects_GLP_SWD 470.0 NaN NaN NaN 14.457660 13.471427 0.000000 0.000000 13.850000 21.350000 84.200000
EOG_EOCSubjects_CACR_SWD 470.0 NaN NaN NaN 9.320426 10.834796 0.000000 0.000000 8.250000 14.300000 84.200000
EOG_EOCSubjects_GLP_SWD 470.0 NaN NaN NaN 14.861277 13.664786 0.000000 0.000000 14.300000 21.550000 84.200000
GraduationRate_4yr_SWD 470.0 NaN NaN NaN 50.764681 36.517810 0.000000 0.000000 66.700000 80.000000 100.000000
GraduationRate_5yr_SWD 470.0 NaN NaN NaN 51.971489 37.348827 0.000000 0.000000 69.800000 81.800000 100.000000
EOCBiology_CACR_AIG 470.0 NaN NaN NaN 69.184255 38.255702 0.000000 63.625000 87.950000 94.700000 100.000000
EOCEnglish2_CACR_AIG 470.0 NaN NaN NaN 72.812979 38.256516 0.000000 72.525000 91.250000 100.000000 100.000000
EOCMathI_CACR_AIG 470.0 NaN NaN NaN 72.531915 38.037694 0.000000 70.450000 90.900000 100.000000 100.000000
EOCSubjects_CACR_AIG 470.0 NaN NaN NaN 84.091915 24.625901 0.000000 84.200000 91.400000 100.000000 100.000000
GraduationRate_4yr_AIG 470.0 NaN NaN NaN 78.841277 40.301372 0.000000 94.300000 100.000000 100.000000 100.000000
GraduationRate_5yr_AIG 470.0 NaN NaN NaN 77.419574 41.121286 0.000000 92.450000 100.000000 100.000000 100.000000
CurrentYearEOC_pTarget_PctMet 470.0 NaN NaN NaN 0.953077 0.186088 0.000000 1.000000 1.000000 1.000000 1.000000
MathGr10_pTarget_PctMet 470.0 NaN NaN NaN 0.896879 0.276593 0.000000 1.000000 1.000000 1.000000 1.000000
ReadingGr10_pTarget_PctMet 470.0 NaN NaN NaN 0.906260 0.264579 0.000000 1.000000 1.000000 1.000000 1.000000
SciGr11_pTarget_PctMet 470.0 NaN NaN NaN 0.896534 0.280207 0.000000 1.000000 1.000000 1.000000 1.000000
TotalTargets_pTarget_PctMet 470.0 NaN NaN NaN 0.952555 0.120769 0.000000 0.958500 1.000000 1.000000 1.000000
sat_avg_score_num 470.0 NaN NaN NaN 943.529787 331.035054 0.000000 974.000000 1046.000000 1100.750000 1404.000000
lea_sat_avg_score_num 470.0 NaN NaN NaN 1052.346809 76.883797 0.000000 1024.000000 1061.000000 1095.000000 1231.000000
sat_participation_pct 470.0 NaN NaN NaN 0.377038 0.199265 0.000000 0.266000 0.384000 0.504750 0.931000
lea_sat_participation_pct 470.0 NaN NaN NaN 0.408089 0.122462 0.000000 0.322500 0.404500 0.496000 0.790000
ap_participation_pct 470.0 NaN NaN NaN 0.106553 0.109420 0.000000 0.000000 0.090000 0.150000 0.600000
lea_ap_participation_pct 470.0 NaN NaN NaN 0.125957 0.082179 0.000000 0.070000 0.120000 0.190000 0.440000
ap_pct_3_or_above 470.0 NaN NaN NaN 0.315957 0.254256 0.000000 0.000000 0.330000 0.500000 0.950000
lea_ap_pct_3_or_above 470.0 NaN NaN NaN 0.411043 0.160825 0.000000 0.320000 0.420000 0.530000 0.790000
total_specialized_courses 470.0 NaN NaN NaN 0.972532 0.126921 0.000000 0.997441 1.000000 1.000000 1.000000
cte_courses 470.0 NaN NaN NaN 0.641860 0.241591 0.000000 0.590802 0.714402 0.794978 1.000000
univ_college_courses 470.0 NaN NaN NaN 0.234590 0.335411 0.000000 0.022885 0.076502 0.218834 1.000000
lea_total_specialized_courses 470.0 NaN NaN NaN 0.976080 0.117655 0.000000 1.000000 1.000000 1.000000 1.000000
lea_cte_courses 470.0 NaN NaN NaN 0.699532 0.112084 0.000000 0.655961 0.714359 0.758595 0.899340
lea_univ_college_courses 470.0 NaN NaN NaN 0.120901 0.095261 0.000000 0.051443 0.108498 0.175459 0.608187
st_total_specialized_courses 470.0 NaN NaN NaN 0.987234 0.112383 0.000000 1.000000 1.000000 1.000000 1.000000
ALL_All_Students__Total_or_Subtotal_ENROLL_sch_pct 470.0 NaN NaN NaN 49.407119 16.665647 0.000000 43.469064 52.178615 59.198851 89.583333
ECODIS_Economically_Disadvantaged_ENROLL_sch_pct 470.0 NaN NaN NaN 13.818204 10.412653 0.000000 6.271930 14.124041 20.919193 60.416667
F_Female_ENROLL_sch_pct 470.0 NaN NaN NaN 28.120319 10.651209 0.000000 24.390244 28.897569 33.333333 66.666667
M_Male_ENROLL_sch_pct 470.0 NaN NaN NaN 21.286800 8.375260 0.000000 17.205551 22.307654 26.466698 48.387097
MB_Black_ENROLL_sch_pct 470.0 NaN NaN NaN 11.239856 12.062711 0.000000 0.000000 9.185877 17.182470 64.864865
MW_White_ENROLL_sch_pct 470.0 NaN NaN NaN 30.042317 17.241100 0.000000 17.293451 32.493479 43.663855 70.000000
avg_daily_attend_pct 470.0 NaN NaN NaN 0.943781 0.025043 0.845000 0.929000 0.943000 0.957000 1.000000
crime_per_c_num 470.0 NaN NaN NaN 0.989787 0.878575 0.000000 0.330000 0.860000 1.430000 5.090000
short_susp_per_c_num 470.0 NaN NaN NaN 15.674766 16.826811 0.000000 4.290000 10.740000 21.942500 125.310000
long_susp_per_c_num 470.0 NaN NaN NaN 0.096979 0.282454 0.000000 0.000000 0.000000 0.057500 3.050000
expelled_per_c_num 470.0 NaN NaN NaN 0.001468 0.013595 0.000000 0.000000 0.000000 0.000000 0.220000
stud_internet_comp_num 470.0 NaN NaN NaN 1.194681 1.628103 0.000000 0.720000 0.900000 1.277500 28.430000
lea_avg_daily_attend_pct 470.0 NaN NaN NaN 0.943560 0.007804 0.911000 0.938250 0.944000 0.948750 0.971000
lea_crime_per_c_num 470.0 NaN NaN NaN 1.185511 0.564448 0.000000 0.870000 1.180000 1.460000 3.790000
lea_short_susp_per_c_num 470.0 NaN NaN NaN 19.153681 13.463419 0.000000 10.480000 15.950000 23.180000 109.090000
lea_long_susp_per_c_num 470.0 NaN NaN NaN 0.113128 0.204563 0.000000 0.000000 0.040000 0.110000 1.140000
lea_expelled_per_c_num 470.0 NaN NaN NaN 0.002170 0.006721 0.000000 0.000000 0.000000 0.000000 0.050000
lea_stud_internet_comp_num 470.0 NaN NaN NaN 1.019468 0.270879 0.490000 0.840000 0.970000 1.140000 1.680000
st_avg_daily_attend_pct 470.0 NaN NaN NaN 0.946213 0.001130 0.946000 0.946000 0.946000 0.946000 0.954000
st_crime_per_c_num 470.0 NaN NaN NaN 1.210170 0.115332 0.530000 1.210000 1.210000 1.210000 1.670000
st_short_susp_per_c_num 470.0 NaN NaN NaN 18.225319 4.034035 9.090000 17.750000 17.750000 17.750000 42.140000
digital_media_pct 470.0 NaN NaN NaN 0.040340 0.123888 0.000000 0.000000 0.010000 0.030000 1.000000
avg_age_media_collection 470.0 NaN NaN NaN 1637.702128 770.349031 0.000000 1991.000000 1998.000000 2002.000000 2017.000000
books_per_student 470.0 NaN NaN NaN 12.625170 36.133162 0.000000 4.490000 9.485000 13.335000 717.000000
lea_books_per_student 470.0 NaN NaN NaN 16.763915 10.886062 0.000000 11.830000 18.170000 21.790000 84.900000
wap_num 470.0 NaN NaN NaN 63.693617 41.460282 0.000000 27.250000 64.500000 88.000000 224.000000
wap_per_classroom 470.0 NaN NaN NaN 1.318277 0.885091 0.000000 1.020000 1.230000 1.430000 13.000000
lea_wap_num 470.0 NaN NaN NaN 2364.170213 3116.116579 77.000000 604.000000 1206.000000 2630.750000 13584.000000
lea_wap_per_classroom 470.0 NaN NaN NaN 1.207468 0.274766 0.300000 1.100000 1.220000 1.340000 2.250000
flicensed_teach_pct 470.0 NaN NaN NaN 0.897266 0.109172 0.000000 0.864250 0.923000 0.964000 1.000000
tchyrs_0thru3_pct 470.0 NaN NaN NaN 0.205411 0.116004 0.000000 0.135250 0.195500 0.256750 0.833000
tchyrs_4thru10_pct 470.0 NaN NaN NaN 0.242268 0.105517 0.000000 0.188500 0.234000 0.288250 0.857000
tchyrs_11plus_pct 470.0 NaN NaN NaN 0.548079 0.133929 0.000000 0.481000 0.557000 0.632000 1.000000
nbpts_num 470.0 NaN NaN NaN 6.893617 6.542738 0.000000 2.000000 5.000000 10.000000 38.000000
advance_dgr_pct 470.0 NaN NaN NaN 0.263413 0.118239 0.000000 0.188000 0.250000 0.311000 0.800000
_1yr_tchr_trnovr_pct 470.0 NaN NaN NaN 0.137823 0.086867 0.000000 0.085500 0.136000 0.182000 0.667000
lateral_teach_pct 470.0 NaN NaN NaN 0.091779 0.087274 0.000000 0.031250 0.074000 0.129000 0.698000
lea_flicensed_teach_pct 470.0 NaN NaN NaN 0.893845 0.064147 0.577000 0.869000 0.902000 0.940000 1.000000
lea_tchyrs_0thru3_pct 470.0 NaN NaN NaN 0.209221 0.065852 0.059000 0.170000 0.196000 0.229000 0.533000
lea_tchyrs_4thru10_pct 470.0 NaN NaN NaN 0.237802 0.042907 0.056000 0.215000 0.242000 0.267000 0.400000
lea_tchyrs_11plus_pct 470.0 NaN NaN NaN 0.552845 0.068033 0.356000 0.505000 0.556000 0.591000 0.781000
lea_nbpts_num 470.0 NaN NaN NaN 6.908511 4.240898 0.000000 4.000000 6.000000 9.000000 22.000000
lea_advance_dgr_pct 470.0 NaN NaN NaN 0.247862 0.057492 0.077000 0.212250 0.250000 0.279000 0.500000
lea_1yr_tchr_trnovr_pct 470.0 NaN NaN NaN 0.147364 0.052679 0.000000 0.122000 0.143000 0.163000 0.365000
lea_emer_prov_teach_pct 470.0 NaN NaN NaN 0.002345 0.007012 0.000000 0.000000 0.000000 0.003000 0.130000
0_3_Years_LEA_Exp_Pct_Prin 470.0 NaN NaN NaN 0.442449 0.153451 0.143000 0.333000 0.385000 0.500000 1.000000
10__Years_LEA_Exp_Pct_Prin 470.0 NaN NaN NaN 0.152028 0.092879 0.000000 0.087000 0.153000 0.222000 0.667000
4_10_Years_LEA_Exp_Pct_Prin 470.0 NaN NaN NaN 0.405521 0.130555 0.000000 0.321000 0.440000 0.500000 0.857000
Accomplished_TCHR_Standard_1_Pct 470.0 NaN NaN NaN 0.511819 0.192084 0.000000 0.414250 0.518000 0.624250 1.000000
Accomplished_TCHR_Standard_2_Pct 470.0 NaN NaN NaN 0.502581 0.224969 0.000000 0.338500 0.516500 0.656750 1.000000
Accomplished_TCHR_Standard_3_Pct 470.0 NaN NaN NaN 0.475351 0.227565 0.000000 0.333000 0.466500 0.618000 1.000000
Accomplished_TCHR_Standard_4_Pct 470.0 NaN NaN NaN 0.554217 0.213229 0.000000 0.429000 0.571000 0.706500 1.000000
Accomplished_TCHR_Standard_5_Pct 470.0 NaN NaN NaN 0.418436 0.230837 0.000000 0.250000 0.400000 0.571000 1.000000
Developing_TCHR_Standard_1_Pct 470.0 NaN NaN NaN 0.008370 0.022764 0.000000 0.000000 0.000000 0.000000 0.211000
Developing_TCHR_Standard_2_Pct 470.0 NaN NaN NaN 0.013068 0.035224 0.000000 0.000000 0.000000 0.000000 0.368000
Developing_TCHR_Standard_3_Pct 470.0 NaN NaN NaN 0.012885 0.034490 0.000000 0.000000 0.000000 0.000000 0.368000
Developing_TCHR_Standard_4_Pct 470.0 NaN NaN NaN 0.011832 0.032688 0.000000 0.000000 0.000000 0.012000 0.368000
Developing_TCHR_Standard_5_Pct 470.0 NaN NaN NaN 0.013196 0.036151 0.000000 0.000000 0.000000 0.000000 0.333000
Distinguished_TCHR_Standard_1_Pct 470.0 NaN NaN NaN 0.144638 0.165040 0.000000 0.026000 0.089000 0.200000 0.938000
Distinguished_TCHR_Standard_2_Pct 470.0 NaN NaN NaN 0.106921 0.152714 0.000000 0.000000 0.047500 0.155500 1.000000
Distinguished_TCHR_Standard_3_Pct 470.0 NaN NaN NaN 0.100011 0.148204 0.000000 0.000000 0.042500 0.134250 1.000000
Distinguished_TCHR_Standard_4_Pct 470.0 NaN NaN NaN 0.102530 0.148019 0.000000 0.000000 0.043500 0.133000 1.000000
Distinguished_TCHR_Standard_5_Pct 470.0 NaN NaN NaN 0.090719 0.143691 0.000000 0.000000 0.036500 0.117750 1.000000
Does_Not_Meet_Expected_Growth_TCHR_Student_Growth_Pct 470.0 NaN NaN NaN 0.198009 0.164732 0.000000 0.072750 0.170500 0.291500 0.941000
Exceeds_Expected_Growth_TCHR_Student_Growth_Pct 470.0 NaN NaN NaN 0.264311 0.177140 0.000000 0.143000 0.250000 0.354000 1.000000
Meets_Expected_Growth_TCHR_Student_Growth_Pct 470.0 NaN NaN NaN 0.514247 0.173519 0.000000 0.438250 0.520500 0.614000 1.000000
Not_Demostrated_TCHR_Standard_1_Pct 470.0 NaN NaN NaN 0.000081 0.001056 0.000000 0.000000 0.000000 0.000000 0.018000
Not_Demostrated_TCHR_Standard_2_Pct 470.0 NaN NaN NaN 0.000247 0.002565 0.000000 0.000000 0.000000 0.000000 0.032000
Not_Demostrated_TCHR_Standard_3_Pct 470.0 NaN NaN NaN 0.000464 0.003931 0.000000 0.000000 0.000000 0.000000 0.050000
Not_Demostrated_TCHR_Standard_4_Pct 470.0 NaN NaN NaN 0.000406 0.003407 0.000000 0.000000 0.000000 0.000000 0.043000
Not_Demostrated_TCHR_Standard_5_Pct 470.0 NaN NaN NaN 0.000240 0.002628 0.000000 0.000000 0.000000 0.000000 0.032000
Proficient_TCHR_Standard_1_Pct 470.0 NaN NaN NaN 0.328757 0.218675 0.000000 0.163250 0.294000 0.478750 1.000000
Proficient_TCHR_Standard_2_Pct 470.0 NaN NaN NaN 0.368694 0.248084 0.000000 0.172250 0.339500 0.549750 1.000000
Proficient_TCHR_Standard_3_Pct 470.0 NaN NaN NaN 0.402806 0.250892 0.000000 0.205000 0.409000 0.576750 1.000000
Proficient_TCHR_Standard_4_Pct 470.0 NaN NaN NaN 0.324653 0.230778 0.000000 0.143500 0.282000 0.475750 1.000000
Proficient_TCHR_Standard_5_Pct 470.0 NaN NaN NaN 0.468932 0.261871 0.000000 0.258250 0.500000 0.667000 1.000000
AsianFemalePct 470.0 NaN NaN NaN 0.011874 0.019683 0.000000 0.001496 0.005321 0.014185 0.175879
AsianMalePct 470.0 NaN NaN NaN 0.011414 0.020989 0.000000 0.001329 0.005306 0.012473 0.251256
BlackFemalePct 470.0 NaN NaN NaN 0.129011 0.120363 0.000000 0.032610 0.101658 0.193336 0.861386
BlackMalePct 470.0 NaN NaN NaN 0.118711 0.115140 0.000000 0.033192 0.086045 0.173430 0.910569
BlackPct 470.0 NaN NaN NaN 0.247722 0.218562 0.000000 0.072029 0.197622 0.377426 0.953368
HispanicFemalePct 470.0 NaN NaN NaN 0.075055 0.056356 0.000000 0.035671 0.058095 0.100709 0.313653
HispanicMalePct 470.0 NaN NaN NaN 0.069789 0.051337 0.000000 0.032990 0.055761 0.092265 0.293413
HispanicPct 470.0 NaN NaN NaN 0.144844 0.101342 0.000000 0.071533 0.119806 0.190982 0.572455
IndianFemalePct 470.0 NaN NaN NaN 0.007351 0.031246 0.000000 0.000000 0.001192 0.003320 0.410076
MinorityFemalePct 470.0 NaN NaN NaN 0.242644 0.141006 0.000000 0.129955 0.223682 0.337701 0.970297
MinorityMalePct 470.0 NaN NaN NaN 0.223486 0.134714 0.000000 0.121962 0.198011 0.307937 0.983740
MinorityPct 470.0 NaN NaN NaN 0.466131 0.247522 0.037234 0.261599 0.424047 0.660880 0.993377
PacificIslandFemalePct 470.0 NaN NaN NaN 0.000673 0.001585 0.000000 0.000000 0.000000 0.000806 0.014925
PacificIslandMalePct 470.0 NaN NaN NaN 0.000495 0.001351 0.000000 0.000000 0.000000 0.000442 0.015625
PacificIslandPct 470.0 NaN NaN NaN 0.001168 0.002286 0.000000 0.000000 0.000000 0.001391 0.015625
TwoOrMoreFemalePct 470.0 NaN NaN NaN 0.018680 0.011561 0.000000 0.011276 0.017000 0.023824 0.083333
TwoOrMoreMalePct 470.0 NaN NaN NaN 0.016422 0.011204 0.000000 0.009342 0.015217 0.022076 0.125000
TwoOrMorePct 470.0 NaN NaN NaN 0.035102 0.018873 0.000000 0.022892 0.032004 0.044748 0.125000
Gr_9_Pct_Prof 470.0 NaN NaN NaN 38.441915 20.842859 0.000000 24.900000 35.150000 50.725000 100.000000
pct_eds 470.0 NaN NaN NaN 45.679787 16.926037 0.000000 34.725000 45.700000 57.100000 100.000000
AAVC_Concentrator_Ct 470.0 NaN NaN NaN 12.089362 28.321421 0.000000 0.000000 0.000000 12.000000 319.000000
AGNR_Concentrator_Ct 470.0 NaN NaN NaN 42.148936 65.731501 0.000000 0.000000 22.000000 57.000000 596.000000
ARCH_Concentrator_Ct 470.0 NaN NaN NaN 21.478723 54.884241 0.000000 0.000000 8.000000 27.000000 759.000000
BMA_Concentrator_Ct 470.0 NaN NaN NaN 20.461702 54.193895 0.000000 0.000000 6.000000 22.000000 832.000000
HLTH_Concentrator_Ct 470.0 NaN NaN NaN 29.227660 62.371995 0.000000 0.000000 14.000000 30.000000 617.000000
HOSP_Concentrator_Ct 470.0 NaN NaN NaN 32.617021 64.259984 0.000000 0.000000 15.000000 38.000000 795.000000
INFO_Concentrator_Ct 470.0 NaN NaN NaN 19.106383 43.775710 0.000000 0.000000 6.500000 20.000000 500.000000
MANU_Concentrator_Ct 470.0 NaN NaN NaN 9.331915 29.498445 0.000000 0.000000 1.000000 8.000000 446.000000
MRKT_Concentrator_Ct 470.0 NaN NaN NaN 10.459574 48.715630 0.000000 0.000000 0.000000 6.000000 832.000000
STEM_Concentrator_Ct 470.0 NaN NaN NaN 14.604255 53.395703 0.000000 0.000000 0.000000 9.750000 530.000000
TRAN_Concentrator_Ct 470.0 NaN NaN NaN 7.655319 20.773536 0.000000 0.000000 0.000000 7.000000 253.000000
Number_Industry_Recognized_Crede 470.0 NaN NaN NaN 301.419149 288.855828 0.000000 35.500000 250.500000 444.500000 1674.000000
grade_range_cd_11_12 470.0 NaN NaN NaN 0.012766 0.112383 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_11_13 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_3_12 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_6_12 470.0 NaN NaN NaN 0.012766 0.112383 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_6_13 470.0 NaN NaN NaN 0.008511 0.091958 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_7_12 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_7_13 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_8_12 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_9_11 470.0 NaN NaN NaN 0.008511 0.091958 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_9_12 470.0 NaN NaN NaN 0.768085 0.422505 0.000000 1.000000 1.000000 1.000000 1.000000
grade_range_cd_9_13 470.0 NaN NaN NaN 0.140426 0.347798 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_9_9 470.0 NaN NaN NaN 0.014894 0.121256 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_K_12 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_PK_12 470.0 NaN NaN NaN 0.008511 0.091958 0.000000 0.000000 0.000000 0.000000 1.000000
grade_range_cd_PK_13 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
calendar_type_txt_Regular_School__Year_Round_Calendar 470.0 NaN NaN NaN 0.042553 0.202063 0.000000 0.000000 0.000000 0.000000 1.000000
esea_status_P 470.0 NaN NaN NaN 0.034043 0.181532 0.000000 0.000000 0.000000 0.000000 1.000000
Grad_project_status_Y 470.0 NaN NaN NaN 0.338298 0.473635 0.000000 0.000000 0.000000 1.000000 1.000000
SBE_District_Northeast 470.0 NaN NaN NaN 0.091489 0.288611 0.000000 0.000000 0.000000 0.000000 1.000000
SBE_District_Northwest 470.0 NaN NaN NaN 0.082979 0.276144 0.000000 0.000000 0.000000 0.000000 1.000000
SBE_District_Piedmont_Triad 470.0 NaN NaN NaN 0.155319 0.362595 0.000000 0.000000 0.000000 0.000000 1.000000
SBE_District_Sandhills 470.0 NaN NaN NaN 0.100000 0.300320 0.000000 0.000000 0.000000 0.000000 1.000000
SBE_District_Southeast 470.0 NaN NaN NaN 0.106383 0.308656 0.000000 0.000000 0.000000 0.000000 1.000000
SBE_District_Southwest 470.0 NaN NaN NaN 0.185106 0.388798 0.000000 0.000000 0.000000 0.000000 1.000000
SBE_District_Western 470.0 NaN NaN NaN 0.091489 0.288611 0.000000 0.000000 0.000000 0.000000 1.000000
SPG_Grade_A_NG 470.0 NaN NaN NaN 0.055319 0.228846 0.000000 0.000000 0.000000 0.000000 1.000000
SPG_Grade_B 470.0 NaN NaN NaN 0.370213 0.483376 0.000000 0.000000 0.000000 1.000000 1.000000
SPG_Grade_C 470.0 NaN NaN NaN 0.346809 0.476461 0.000000 0.000000 0.000000 1.000000 1.000000
SPG_Grade_D 470.0 NaN NaN NaN 0.076596 0.266232 0.000000 0.000000 0.000000 0.000000 1.000000
Reading_SPG_Grade_B 470.0 NaN NaN NaN 0.010638 0.102701 0.000000 0.000000 0.000000 0.000000 1.000000
Reading_SPG_Grade_C 470.0 NaN NaN NaN 0.008511 0.091958 0.000000 0.000000 0.000000 0.000000 1.000000
Reading_SPG_Grade_D 470.0 NaN NaN NaN 0.006383 0.079723 0.000000 0.000000 0.000000 0.000000 1.000000
Reading_SPG_Grade_F 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
Math_SPG_Grade_B 470.0 NaN NaN NaN 0.008511 0.091958 0.000000 0.000000 0.000000 0.000000 1.000000
Math_SPG_Grade_C 470.0 NaN NaN NaN 0.006383 0.079723 0.000000 0.000000 0.000000 0.000000 1.000000
Math_SPG_Grade_D 470.0 NaN NaN NaN 0.006383 0.079723 0.000000 0.000000 0.000000 0.000000 1.000000
Math_SPG_Grade_F 470.0 NaN NaN NaN 0.010638 0.102701 0.000000 0.000000 0.000000 0.000000 1.000000
EVAAS_Growth_Status_Met 470.0 NaN NaN NaN 0.357447 0.479759 0.000000 0.000000 0.000000 1.000000 1.000000
EVAAS_Growth_Status_NotMet 470.0 NaN NaN NaN 0.293617 0.455904 0.000000 0.000000 0.000000 1.000000 1.000000
State_Gap_Compared_Y 470.0 NaN NaN NaN 0.065957 0.248472 0.000000 0.000000 0.000000 0.000000 1.000000
Byod_Yes 470.0 NaN NaN NaN 0.444681 0.497460 0.000000 0.000000 0.000000 1.000000 1.000000
grades_BYOD_11_12 470.0 NaN NaN NaN 0.008511 0.091958 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_11_12_13 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_12 470.0 NaN NaN NaN 0.008511 0.091958 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_6_7_8_9_10_11_12 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_6_7_8_9_10_11_12_13 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_8_9_10_11_12 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_9 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_9_10_11 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_9_10_11_12 470.0 NaN NaN NaN 0.317021 0.465812 0.000000 0.000000 0.000000 1.000000 1.000000
grades_BYOD_9_10_11_12_13 470.0 NaN NaN NaN 0.080851 0.272897 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_9_11_12 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grades_BYOD_PK_9_10_11_12 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
_1_to_1_access_Yes 470.0 NaN NaN NaN 0.506383 0.500492 0.000000 0.000000 1.000000 1.000000 1.000000
grades_1_to_1_access_10_11_12 470.0 NaN NaN NaN 0.008511 0.091958 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_10_11_12_13 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_11 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_11_12 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_11_12_13 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_6_07_08 470.0 NaN NaN NaN 0.006383 0.079723 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_6_7_8_9_10_11_12 470.0 NaN NaN NaN 0.008511 0.091958 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_6_7_8_9_10_11_12_13 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_9 470.0 NaN NaN NaN 0.027660 0.164170 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_9_10 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_9_10_11 470.0 NaN NaN NaN 0.025532 0.157902 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_9_10_11_12 470.0 NaN NaN NaN 0.325532 0.469073 0.000000 0.000000 0.000000 1.000000 1.000000
grades_1_to_1_access_9_10_11_12_13 470.0 NaN NaN NaN 0.078723 0.269594 0.000000 0.000000 0.000000 0.000000 1.000000
grades_1_to_1_access_9_11_12_13 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_devices_sent_home_Yes 470.0 NaN NaN NaN 0.414894 0.493229 0.000000 0.000000 0.000000 1.000000 1.000000
SRC_Grades_Devices_Sent_Home_10_11_12 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_Grades_Devices_Sent_Home_10_11_12_13 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_Grades_Devices_Sent_Home_6_07_08 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_Grades_Devices_Sent_Home_6_7_8_9_10_11_12 470.0 NaN NaN NaN 0.004255 0.065163 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_Grades_Devices_Sent_Home_6_7_8_9_10_11_12_13 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_Grades_Devices_Sent_Home_8_9_10_11_12_13 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_Grades_Devices_Sent_Home_9_10 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_Grades_Devices_Sent_Home_9_10_11 470.0 NaN NaN NaN 0.017021 0.129488 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_Grades_Devices_Sent_Home_9_10_11_12 470.0 NaN NaN NaN 0.263830 0.441178 0.000000 0.000000 0.000000 1.000000 1.000000
SRC_Grades_Devices_Sent_Home_9_10_11_12_13 470.0 NaN NaN NaN 0.078723 0.269594 0.000000 0.000000 0.000000 0.000000 1.000000
SRC_Grades_Devices_Sent_Home_9_10_12 470.0 NaN NaN NaN 0.002128 0.046127 0.000000 0.000000 0.000000 0.000000 1.000000
unit_code 470.0 416.0 207.0 5.0 NaN NaN NaN NaN NaN NaN NaN
ACT_Score 470.0 NaN NaN NaN 59.182979 22.639111 0.000000 46.000000 59.000000 73.000000 100.000000

Here is a description of each variable in our dataset. The data table column refers to the raw csv file where it can be found. Please refer to the README file from Dr. Drew for URLs to the data sources: https://github.com/jakemdrew/EducationDataNC/blob/master/2017/Raw%20Datasets/README.md

Column Name Description Data Table
student_num Number of students at school level (school size) Profile
lea_avg_student_num Average school size within the LEA Profile
st_avg_student_num Average school size within the state Profile
09_Size Average grade size for school Profile Metric
10_Size Average grade size for school Profile Metric
11_Size Average grade size for school Profile Metric
12_Size Average grade size for school Profile Metric
Biology_Size Average class size for school Profile Metric
English_II_Size Average class size for school Profile Metric
Math_I_Size Average class size for school Profile Metric
lea_total_expense_num Total expense (Dollars Spent) at LEA level Funding
lea_salary_expense_pct Percent of expense spent on Salaries at LEA level Funding
lea_services_expense_pct Percent of expense spent on Services at LEA level Funding
lea_supplies_expense_pct Percent of expense spent on Supplies at LEA level Funding
lea_instruct_equip_exp_pct Percent of expense spent on Instructional Equipment at LEA level Funding
lea_federal_perpupil_num Federal expense per pupil at LEA level Funding
lea_local_perpupil_num Local expense per pupil at school level Funding
lea_state_perpupil_num State expense per pupil at LEA level Funding
SPG_Score School Performance Grade score SPG
EVAAS_Growth_Score Education Value-Added Assessment System Growth score SPG
NC_Math_1_Score Average score for NC Math 1 SPG
English_II_Score Average score for English II SPG
Biology_Score Average score for Biology SPG
Passing_NC_Math_3 Average score for Passing NC Math 3 SPG
4_Year_Cohort_Graduation_Rate_Score Average score for 4 year cohort graduation rate SPG
EOCSubjects_CACR_All Average percent by school for All End of Course Subjects by College and Career Ready standard READY (acctdrilldwn)
GraduationRate_5yr_All Graduation Rate for 5 year (Extended) READY (acctdrilldwn)
EOCBiology_CACR_Female EOC Biology Score by School/CACR - Female READY (acctdrilldwn)
EOCEnglish2_CACR_Female EOC English 2 Score by School/CACR - Female READY (acctdrilldwn)
GraduationRate_4yr_Female Female Graduation Rate for 4 year (Standard) READY (acctdrilldwn)
GraduationRate_5yr_Female Female Graduation Rate for 5 year (Extended) READY (acctdrilldwn)
EOCBiology_CACR_Male EOC Biology Score by School/CACR - Male READY (acctdrilldwn)
EOCEnglish2_CACR_Male EOC English 2 Score by School/CACR - Male READY (acctdrilldwn)
EOCMathI_CACR_Male EOC Math 1 Score by School/CACR - Male READY (acctdrilldwn)
GraduationRate_4yr_Male Male Graduation Rate for 4 year READY (acctdrilldwn)
GraduationRate_5yr_Male Male Gradation Rate for 5 year READY (acctdrilldwn)
EOCBiology_CACR_AmericanIndian EOC Biology Score by School/CACR -Ethnicity READY (acctdrilldwn)
EOCEnglish2_CACR_AmericanIndian EOC English 2 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCMathI_CACR_AmericanIndian EOC Math 1 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCSubjects_CACR_AmericanIndian EOC Subjects by School/CACR - Ethnicity READY (acctdrilldwn)
GraduationRate_4yr_AmericanIndian Graduation Rate for 4 year - Ethnicity READY (acctdrilldwn)
GraduationRate_5yr_AmericanIndian Graduation Rate for 5 year - Ethnicity READY (acctdrilldwn)
EOCBiology_CACR_Asian EOC Biology Score by School/CACR -Ethnicity READY (acctdrilldwn)
EOCEnglish2_CACR_Asian EOC English 2 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCMathI_CACR_Asian EOC Math1 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCSubjects_CACR_Asian EOC Subjects by School/CACR - Ethnicity READY (acctdrilldwn)
GraduationRate_4yr_Asian Graduation Rate for 4 year - Ethnicity READY (acctdrilldwn)
GraduationRate_5yr_Asian Graduation Rate for 5 year - Ethnicity READY (acctdrilldwn)
EOCBiology_CACR_Black EOC Biology Score by School/CACR -Ethnicity READY (acctdrilldwn)
EOCEnglish2_CACR_Black EOC English 2 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCMathI_CACR_Black EOC Math 1 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCSubjects_CACR_Black EOC Subjects by School/CACR - Ethnicity READY (acctdrilldwn)
GraduationRate_4yr_Black Graduation Rate for 4 year - Ethnicity READY (acctdrilldwn)
GraduationRate_5yr_Black Graduation Rate for 5 year - Ethnicity READY (acctdrilldwn)
EOCBiology_CACR_Hispanic EOC Biology Score by School/CACR -Ethnicity READY (acctdrilldwn)
EOCEnglish2_CACR_Hispanic EOC English 2 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCMathI_CACR_Hispanic EOC Math 1 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCSubjects_CACR_Hispanic EOC Subjects by School/CACR - Ethnicity READY (acctdrilldwn)
GraduationRate_4yr_Hispanic Graduation Rate for 4 year - Ethnicity READY (acctdrilldwn)
GraduationRate_5yr_Hispanic Graduation Rate for 5 year - Ethnicity READY (acctdrilldwn)
EOCBiology_CACR_TwoorMoreRaces EOC Biology Score by School/CACR -Ethnicity READY (acctdrilldwn)
EOCEnglish2_CACR_TwoorMoreRaces EOC English 2 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCMathI_CACR_TwoorMoreRaces EOC Math 1 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCSubjects_CACR_TwoorMoreRaces EOC Subjects by School/CACR - Ethnicity READY (acctdrilldwn)
GraduationRate_4yr_TwoorMoreRaces Graduation Rate for 4 year - Ethnicity READY (acctdrilldwn)
GraduationRate_5yr_TwoorMoreRaces Graduation Rate for 5 year - Ethnicity READY (acctdrilldwn)
EOCBiology_CACR_White EOC Biology Score by School/CACR -Ethnicity READY (acctdrilldwn)
EOCEnglish2_CACR_White EOC English 2 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCMathI_CACR_White EOC Math 1 Score by School/CACR - Ethnicity READY (acctdrilldwn)
EOCSubjects_CACR_White EOC Subjects by School/CACR - Ethnicity READY (acctdrilldwn)
GraduationRate_4yr_White Graduation Rate for 4 year - Ethnicity READY (acctdrilldwn)
GraduationRate_5yr_White Graduation Rate for 5 year - Ethnicity READY (acctdrilldwn)
EOCBiology_CACR_EDS EOC Biology Score by School/CACR - Economically Disadvantaged READY (acctdrilldwn)
EOCEnglish2_CACR_EDS EOC English 2 Score by School/CACR - Economically Disadvantaged READY (acctdrilldwn)
EOCMathI_CACR_EDS EOC Math 1 Score by School/CACR - Economically Disadvantaged READY (acctdrilldwn)
EOCSubjects_CACR_EDS EOC Subjects by School/CACR - Economically Disadvantaged READY (acctdrilldwn)
GraduationRate_4yr_EDS Graduation Rate for 4 year - Economically Disadvantaged READY (acctdrilldwn)
GraduationRate_5yr_EDS Graduation Rate for 5 Year - Economically Disadvantaged READY (acctdrilldwn)
EOCBiology_CACR_LEP EOC Biology Score by School/CACR - Limited English Proficient READY (acctdrilldwn)
EOCBiology_GLP_LEP EOC Biology Score by School/Grade Level Proficient (GLP) - Limited English Proficiency READY (acctdrilldwn)
EOCEnglish2_CACR_LEP EOC English 2 by School/CACR - Limited English Proficiency READY (acctdrilldwn)
EOCEnglish2_GLP_LEP EOC English 2 by School/GLP - Limited English Proficiency READY (acctdrilldwn)
EOCMathI_CACR_LEP EOC Math 1 by School/CACR - Limited English Proficiency READY (acctdrilldwn)
EOCMathI_GLP_LEP EOC Math 1 by School/GLP - Limited English Proficiency READY (acctdrilldwn)
EOCSubjects_CACR_LEP EOC Subjects by School/CACR - Limited English Proficiency READY (acctdrilldwn)
EOCSubjects_GLP_LEP EOC Subjects by School/GLP - Limited English Proficiency READY (acctdrilldwn)
EOG_EOCSubjects_GLP_LEP End of Grade EOC Subjects by School/GLP - Limited English Proficiency READY (acctdrilldwn)
GraduationRate_4yr_LEP Graduation Rate for 4 Year - Limited English Proficiency READY (acctdrilldwn)
GraduationRate_5yr_LEP Graduation Rate for 5 Year - Limited English Proficiency READY (acctdrilldwn)
EOCBiology_CACR_SWD EOC Biology Score by School/CACR - Students with Disabilities READY (acctdrilldwn)
EOCBiology_GLP_SWD EOC Biology Score by School/GLP - Students with Disabilities READY (acctdrilldwn)
EOCEnglish2_CACR_SWD EOC English 2 Score by School/CACR - Students with Disabilities READY (acctdrilldwn)
EOCEnglish2_GLP_SWD EOC English 2 Score by School/GLP - Students with Disabilities READY (acctdrilldwn)
EOCMathI_CACR_SWD EOC Math 1 Score by School/CACR - Students with Disabilities READY (acctdrilldwn)
EOCMathI_GLP_SWD EOC Math 1 Score by School/GLP - Students with Disabilities READY (acctdrilldwn)
EOCSubjects_CACR_SWD EOC Subjects Score by School/CACR - Students with Disabilities READY (acctdrilldwn)
EOCSubjects_GLP_SWD EOC Subjects Score by School/GLP - Students with Disabilities READY (acctdrilldwn)
EOG_EOCSubjects_CACR_SWD End of Grade EOC Subjects by School/CACR - Students with Disabilities READY (acctdrilldwn)
EOG_EOCSubjects_GLP_SWD End of Grade EOC Subjects by School/GLP - Students with Disabilities READY (acctdrilldwn)
GraduationRate_4yr_SWD Graduation Rate for 4 Year - Students with Disabilities READY (acctdrilldwn)
GraduationRate_5yr_SWD Graduation Rate for 5 Year - Students With Disabilities READY (acctdrilldwn)
EOCBiology_CACR_AIG EOC Biology Score by School/CACR - Academically or Intellectually Gifted READY (acctdrilldwn)
EOCEnglish2_CACR_AIG EOC English 2 Score by School/CACR - Academically or Intellectually Gifted READY (acctdrilldwn)
EOCMathI_CACR_AIG EOC Math 1 Score by School/CACR - Academically or Intellectually Gifted READY (acctdrilldwn)
EOCSubjects_CACR_AIG EOC Subjects by School/CACR- Academically or Intellectually Gifted READY (acctdrilldwn)
GraduationRate_4yr_AIG Graduation Rate for 4 Year - Academically or Intellectually Gifted READY (acctdrilldwn)
GraduationRate_5yr_AIG Grauduation Rate for 5 Year - Academically or Intellectually Gifted READY (acctdrilldwn)
CurrentYearEOC_pTarget_PctMet Percentage of participation target met for Current Year EOC Participation Targets Overall
MathGr10_pTarget_PctMet Percentage of participation target met for Grade 10 Math Participation Targets Overall
ReadingGr10_pTarget_PctMet Percentage of participation target met for Grade 10 Reading Participation Targets Overall
SciGr11_pTarget_PctMet Percentage of participation target met for Grade 11 Science Participation Targets Overall
TotalTargets_pTarget_PctMet Percentage of participation target met for Total Targets Participation Targets Overall
sat_avg_score_num Average SAT Score (Critical Reading plus Math) at the School Level School Indicators
lea_sat_avg_score_num Average SAT Score (Critical Reading + Math) at the LEA level School Indicators
sat_participation_pct Percentage of High School Seniors taking the SAT at the School Level School Indicators
lea_sat_participation_pct Percentage of High School Seniors taking the SAT at the LEA level School Indicators
ap_participation_pct Percentage of High School Students taking an AP exam at the School Level School Indicators
lea_ap_participation_pct Pecentage of High School Students taking an AP exam at the LEA Level School Indicators
ap_pct_3_or_above Percentage of AP Exams with Scores of 3 or Above at the School Level School Indicators
lea_ap_pct_3_or_above Percentage of AP Exams with Scores of 3 or Above at the LEA Level School Indicators
total_specialized_courses Percent of students enrolled in at least one specialized course (CTE, AP/IB, Community College or University academic course) at the school level Specializied Course Enrollment
cte_courses Percent of students enrolled in at least one Career and Technical Education (CTE) course at the school level Specializied Course Enrollment
univ_college_courses Percent of students enrolled in at least one academic course at a community college or university at the school level Specializied Course Enrollment
lea_total_specialized_courses Percent of students enrolled in at least one specialized course (CTE, AP/IB, Community College or University academic course) at the LEA level Specializied Course Enrollment
lea_cte_courses Percent of students enrolled in at least one Career and Technical Education (CTE) course at the LEA level Specializied Course Enrollment
lea_univ_college_courses Percent of students enrolled in at least one academic course at a community college or university at the LEA level Specializied Course Enrollment
st_total_specialized_courses Percent of students enrolled in at least one specialized course (CTE, AP/IB, Community College or University academic course) at the state level Specializied Course Enrollment
ALL_All_Students__Total_or_Subtotal_ENROLL_sch_pct The percentage of students enrolled in college by school College Enrollment
ECODIS_Economically_Disadvantaged_ENROLL_sch_pct The percentage of economically disadvantaged students enrolled in college by school College Enrollment
F_Female_ENROLL_sch_pct The percentage of female students enrolled in college by school College Enrollment
M_Male_ENROLL_sch_pct The percentage of male students enrolled in college by school College Enrollment
MB_Black_ENROLL_sch_pct The percentage of black male students enrolled in college by school College Enrollment
MW_White_ENROLL_sch_pct The percentage of white male students enrolled in college by school College Enrollment
avg_daily_attend_pct Average daily attendance percentage at school level Environment
crime_per_c_num Number of crimes or acts of violence per 100 students at School level Environment
short_susp_per_c_num Short term suspensions per 100 students at school level Environment
long_susp_per_c_num Long term suspensions per 100 students at school level Environment
expelled_per_c_num Expulsions per 100 students at school level Environment
stud_internet_comp_num Ratio of students to internet connected computer at school level Environment
lea_avg_daily_attend_pct Average daily attendance percentage at LEA level Environment
lea_crime_per_c_num Number of crimes or acts of violence per 100 students at LEA level Environment
lea_short_susp_per_c_num Short term suspensions per 100 students at LEA level Environment
lea_long_susp_per_c_num Long term suspensions per 100 students at LEA level Environment
lea_expelled_per_c_num Expulsions per 100 students at LEA level Environment
lea_stud_internet_comp_num Ratio of students to internet connected at LEA level Environment
st_avg_daily_attend_pct Average daily attendance percentage at State level Environment
st_crime_per_c_num Number of crimes or acts of violence per 100 students at State level Environment
st_short_susp_per_c_num Short term supensions per 100 students at State level Environment
digital_media_pct Percentage of digital media by school Environment
avg_age_media_collection Average age of media collection by school Environment
books_per_student Average number of books per student by school Environment
lea_books_per_student Average number of books per student at LEA level Environment
wap_num Average number of students per instructional device at the school level Environment
wap_per_classroom Average number of instructional devices per classroom at the school level Environment
lea_wap_num Average number of students per instructional device at the LEA level Environment
lea_wap_per_classroom Average number of instructional devices per classroom at the LEA level Environment
flicensed_teach_pct Percent of teachers that meet NC fully licensed defintion at school level Personnel
tchyrs_0thru3_pct Percentage of teachers with 0 to 3 years of experience Personnel
tchyrs_4thru10_pct Percentage of teachers with 4 to 10 years of experience Personnel
tchyrs_11plus_pct Percentage of teachers with 11+ years of experience Personnel
nbpts_num Number of National Board Certified Staff at school level Personnel
advance_dgr_pct Percent of teachers with masters or higher degree at school level Personnel
_1yr_tchr_trnovr_pct One Year Teacher turnover percentage at school level Personnel
lateral_teach_pct Lateral entry tacherpercentage at school level Personnel
lea_flicensed_teach_pct Average Percent of Teachers that meet NC fully licensed definition at LEA level Personnel
lea_tchyrs_0thru3_pct Percentage of teachers with 0 to 3 years of experience at the LEA level Personnel
lea_tchyrs_4thru10_pct Percentage of teachers with 4 to 10 years of experience at the LEA level Personnel
lea_tchyrs_11plus_pct Percentage of teachers with 11+ years of experience at the LEA level Personnel
lea_nbpts_num Average number of National Board Certified staff at LEA level Personnel
lea_advance_dgr_pct Average percent of teachers with masters or higher degree at LEA level Personnel
lea_1yr_tchr_trnovr_pct One Year Teacher turnover percentage at LEA level Personnel
lea_emer_prov_teach_pct Personnel
0_3_Years_LEA_Exp_Pct_Prin Percentage of principals with 0-3 years of experience at the LEA level Educator Effectiveness
10__Years_LEA_Exp_Pct_Prin Percentage of principals with 10+ years of experience at the LEA level Educator Effectiveness
4_10_Years_LEA_Exp_Pct_Prin Percentage of principals with 4-10 years of experience at the LEA level Educator Effectiveness
Accomplished_TCHR_Standard_1_Pct Percentage of accomplished level teachers that met standard level 1 Educator Effectiveness
Accomplished_TCHR_Standard_2_Pct Percentage of accomplished level teachers that met standard level 2 Educator Effectiveness
Accomplished_TCHR_Standard_3_Pct Percentage of accomplished level teachers that met standard level 3 Educator Effectiveness
Accomplished_TCHR_Standard_4_Pct Percentage of accomplished level teachers that met standard level 4 Educator Effectiveness
Accomplished_TCHR_Standard_5_Pct Percentage of accomplished level teachers that met standard level 5 Educator Effectiveness
Developing_TCHR_Standard_1_Pct Percentage of developing level teachers that met standard level 1 Educator Effectiveness
Developing_TCHR_Standard_2_Pct Percentage of developing level teachers that met standard level 2 Educator Effectiveness
Developing_TCHR_Standard_3_Pct Percentage of developing level teachers that met standard level 3 Educator Effectiveness
Developing_TCHR_Standard_4_Pct Percentage of developing level teachers that met standard level 4 Educator Effectiveness
Developing_TCHR_Standard_5_Pct Percentage of developing level teachers that met standard level 5 Educator Effectiveness
Distinguished_TCHR_Standard_1_Pct Percentage of distinguished level teachers that met standard level 1 Educator Effectiveness
Distinguished_TCHR_Standard_2_Pct Percentage of distinguished level teachers that met standard level 2 Educator Effectiveness
Distinguished_TCHR_Standard_3_Pct Percentage of distinguished level teachers that met standard level 3 Educator Effectiveness
Distinguished_TCHR_Standard_4_Pct Percentage of distinguished level teachers that met standard level 4 Educator Effectiveness
Distinguished_TCHR_Standard_5_Pct Percentage of distinguished level teachers that met standard level 5 Educator Effectiveness
Does_Not_Meet_Expected_Growth_TCHR_Student_Growth_Pct Student growth percentage for Does Not Meet Expected Growth teachers Educator Effectiveness
Exceeds_Expected_Growth_TCHR_Student_Growth_Pct Student growth percentage for Exeeds Expected Growth teachers Educator Effectiveness
Meets_Expected_Growth_TCHR_Student_Growth_Pct Student growth percentage for Meets Expected Growth teachers Educator Effectiveness
Not_Demostrated_TCHR_Standard_1_Pct Percentage of not demonstrated level teachers that met standard level 1 Educator Effectiveness
Not_Demostrated_TCHR_Standard_2_Pct Percentage of not demonstrated level teachers that met standard level 2 Educator Effectiveness
Not_Demostrated_TCHR_Standard_3_Pct Percentage of not demonstrated level teachers that met standard level 3 Educator Effectiveness
Not_Demostrated_TCHR_Standard_4_Pct Percentage of not demonstrated level teachers that met standard level 4 Educator Effectiveness
Not_Demostrated_TCHR_Standard_5_Pct Percentage of not demonstrated level teachers that met standard level 5 Educator Effectiveness
Proficient_TCHR_Standard_1_Pct Percentage of proficient level teachers that met standard level 1 Educator Effectiveness
Proficient_TCHR_Standard_2_Pct Percentage of proficient level teachers that met standard level 2 Educator Effectiveness
Proficient_TCHR_Standard_3_Pct Percentage of proficient level teachers that met standard level 3 Educator Effectiveness
Proficient_TCHR_Standard_4_Pct Percentage of proficient level teachers that met standard level 4 Educator Effectiveness
Proficient_TCHR_Standard_5_Pct Percentage of proficient level teachers that met standard level 5 Educator Effectiveness
AsianFemalePct Percentage of specified group Statistical Profile
AsianMalePct Percentage of specified group Statistical Profile
BlackFemalePct Percentage of specified group Statistical Profile
BlackMalePct Percentage of specified group Statistical Profile
BlackPct Percentage of specified group Statistical Profile
HispanicFemalePct Percentage of specified group Statistical Profile
HispanicMalePct Percentage of specified group Statistical Profile
HispanicPct Percentage of specified group Statistical Profile
IndianFemalePct Percentage of specified group Statistical Profile
MinorityFemalePct Percentage of specified group Statistical Profile
MinorityMalePct Percentage of specified group Statistical Profile
MinorityPct Percentage of specified group Statistical Profile
PacificIslandFemalePct Percentage of specified group Statistical Profile
PacificIslandMalePct Percentage of specified group Statistical Profile
PacificIslandPct Percentage of specified group Statistical Profile
TwoOrMoreFemalePct Percentage of specified group Statistical Profile
TwoOrMoreMalePct Percentage of specified group Statistical Profile
TwoOrMorePct Percentage of specified group Statistical Profile
Gr_9_Pct_Prof Percentage of proficiency in grade 9 Statistical Profile
pct_eds Percentage of economically disadvantaged Statistical Profile
AAVC_Concentrator_Ct Count of specified concentration by school CTE Concentrations
AGNR_Concentrator_Ct Count of specified concentration by school CTE Concentrations
ARCH_Concentrator_Ct Count of specified concentration by school CTE Concentrations
BMA_Concentrator_Ct Count of specified concentration by school CTE Concentrations
HLTH_Concentrator_Ct Count of specified concentration by school CTE Concentrations
HOSP_Concentrator_Ct Count of specified concentration by school CTE Concentrations
INFO_Concentrator_Ct Count of specified concentration by school CTE Concentrations
MANU_Concentrator_Ct Count of specified concentration by school CTE Concentrations
MRKT_Concentrator_Ct Count of specified concentration by school CTE Concentrations
STEM_Concentrator_Ct Count of specified concentration by school CTE Concentrations
TRAN_Concentrator_Ct Count of specified concentration by school CTE Concentrations
Number_Industry_Recognized_Crede Number of Industry Recognized Credentials by school CTE Credentials
grade_range_cd_11_12 Binary field, range of grades offered Profile
grade_range_cd_11_13 Binary field, range of grades offered Profile
grade_range_cd_3_12 Binary field, range of grades offered Profile
grade_range_cd_6_12 Binary field, range of grades offered Profile
grade_range_cd_6_13 Binary field, range of grades offered Profile
grade_range_cd_7_12 Binary field, range of grades offered Profile
grade_range_cd_7_13 Binary field, range of grades offered Profile
grade_range_cd_8_12 Binary field, range of grades offered Profile
grade_range_cd_9_11 Binary field, range of grades offered Profile
grade_range_cd_9_12 Binary field, range of grades offered Profile
grade_range_cd_9_13 Binary field, range of grades offered Profile
grade_range_cd_9_9 Binary field, range of grades offered Profile
grade_range_cd_K_12 Binary field, range of grades offered Profile
grade_range_cd_PK_12 Binary field, range of grades offered Profile
grade_range_cd_PK_13 Binary field, range of grades offered Profile
calendar_type_txt_Regular_School__Year_Round_Calendar Description of school calendar and school type Profile
esea_status_P Binary field, ESEA status of P Profile
Grad_project_status_Y Binary field, Required Graduation Project Status - Yes Profile
SBE_District_Northeast Binary field, SBE District Northeast
SBE_District_Northwest Binary field, SBE District Northwest
SBE_District_Piedmont_Triad Binary field, SBE District Piedmont Triad
SBE_District_Sandhills Binary field, SBE District Sandhills
SBE_District_Southeast Binary field, SBE District Southeast
SBE_District_Southwest Binary field, SBE District Southwest
SBE_District_Western Binary field, SBE District Western
SPG_Grade_A_NG Binary field, SPG Grade A NG SPG
SPG_Grade_B Binary field, SPG Grade B SPG
SPG_Grade_C Binary field, SPG Grade C SPG
SPG_Grade_D Binary field, SPG Grade D SPG
Reading_SPG_Grade_B Binary field, SPG Reading Grade B SPG
Reading_SPG_Grade_C Binary field, SPG Reading Grade C SPG
Reading_SPG_Grade_D Binary field, SPG Reading Grade D SPG
Reading_SPG_Grade_F Binary field, SPG Reading Grade F SPG
Math_SPG_Grade_B Binary field, SPG Math Grade B SPG
Math_SPG_Grade_C Binary field, SPG Math Grade C SPG
Math_SPG_Grade_D Binary field, SPG Math Grade D SPG
Math_SPG_Grade_F Binary field, SPG Math Grade F SPG
EVAAS_Growth_Status_Met Binary field, EVAAS Growth Status Met SPG
EVAAS_Growth_Status_NotMet Binary field, EVAAS Growth Status Not Met SPG
State_Gap_Compared_Y Binary field, State Gap Compared Y SPG
Byod_Yes Binary field, Bring Your Own Device Yes Environment
grades_BYOD_11_12 Binary field, BYOD for grades 11, 12 Environment
grades_BYOD_11_12_13 Binary field, BYOD for grades 11, 12, 13 Environment
grades_BYOD_12 Binary field, BYOD for grade 12 Environment
grades_BYOD_6_7_8_9_10_11_12 Binary field, BYOD grade 6-12 Environment
grades_BYOD_6_7_8_9_10_11_12_13 Binary field, BYOD grade 6-13 Environment
grades_BYOD_8_9_10_11_12 Binary field, BYOD grade 8-12 Environment
grades_BYOD_9 Binary field, BYOD grade 9 Environment
grades_BYOD_9_10_11 Binary field, BYOD grade 9-11 Environment
grades_BYOD_9_10_11_12 Binary field, BYOD grade 9-12 Environment
grades_BYOD_9_10_11_12_13 Binary field, BYOD grade 9-13 Environment
grades_BYOD_9_11_12 Binary field, BYOD grade 9, 11. 12 Environment
grades_BYOD_PK_9_10_11_12 Binary field, BYOD grade PK, 9-12 Environment
_1_to_1_access_Yes Binary field, 1 to 1 access Yes Environment
grades_1_to_1_access_10_11_12 Binary field, 1 to 1 access grades 10-12 Environment
grades_1_to_1_access_10_11_12_13 Binary field, 1 to 1 access grades 10-13 Environment
grades_1_to_1_access_11 Binary field 1 to 1 access grade 11 Environment
grades_1_to_1_access_11_12 Binary field, 1 to 1 access grades 11-12 Environment
grades_1_to_1_access_11_12_13 Binary field, 1 to 1 access grades 11-13 Environment
grades_1_to_1_access_6_07_08 Binary field, 1 to 1 access grades 6-8 Environment
grades_1_to_1_access_6_7_8_9_10_11_12 Binary field, 1 to 1 access grades 6-12 Environment
grades_1_to_1_access_6_7_8_9_10_11_12_13 Binary field, 1 to 1 access grades 6-13 Environment
grades_1_to_1_access_9 Binary field, 1 to 1 grade 9 Environment
grades_1_to_1_access_9_10 Binary field, 1 to 1 grades 9-10 Environment
grades_1_to_1_access_9_10_11 Binary field, 1 to 1 grades 9-11 Environment
grades_1_to_1_access_9_10_11_12 Binary field, 1 to 1 grades 9-12 Environment
grades_1_to_1_access_9_10_11_12_13 Binary field, 1 to 1 grades 9-13 Environment
grades_1_to_1_access_9_11_12_13 Binary field, 1 to 1 grades 9, 11, 12, 13 Environment
SRC_devices_sent_home_Yes Binary field, SRC devies sent home yes Environment
SRC_Grades_Devices_Sent_Home_10_11_12 Binary field, School Report Card Grades Devices Sent Home Grade 10-12 Environment
SRC_Grades_Devices_Sent_Home_10_11_12_13 Binary field, School Report Card Grades Devices Sent Home grades 10-13 Environment
SRC_Grades_Devices_Sent_Home_6_07_08 Binary field, SRC Grades Devices Sent Home grades 6-8 Environment
SRC_Grades_Devices_Sent_Home_6_7_8_9_10_11_12 Binary field, SRC Grades Devices Sent Home grades 6-12 Environment
SRC_Grades_Devices_Sent_Home_6_7_8_9_10_11_12_13 Binary field, SRC Grades Devices Sent Home grades 6-13 Environment
SRC_Grades_Devices_Sent_Home_8_9_10_11_12_13 Binary field, SRC Grades Devices Sent Home grades 8-13 Environment
SRC_Grades_Devices_Sent_Home_9_10 Binary field, SRC Grades Devices Sent Home grades 9-10 Environment
SRC_Grades_Devices_Sent_Home_9_10_11 Binary field, SRC Grades Devices Sent Home grades 9-11 Environment
SRC_Grades_Devices_Sent_Home_9_10_11_12 Binary field, SRC Grades Devices Sent Home grades 9-12 Environment
SRC_Grades_Devices_Sent_Home_9_10_11_12_13 Binary field, SRC Grades Devices Sent Home grades 9-13 Environment
SRC_Grades_Devices_Sent_Home_9_10_12 Binary field, SRC Grades Devices Sent Home grades 9, 10, 12 Environment
unit_code Code to identify School/LEA/State (Primary Key) Profile
ACT_Score Average ACT Score by School/LEA/State

Missing Data

Given the importance of ACT score to our analysis, we wanted to isolate and examine the schools with an ACT value of 0. We also wanted to understand how many students were at these schools.

In [48]:
zeroScore = dfDropped[dfDropped['ACT_Score'] == 0]
zeroScore[['student_num', 'ACT_Score']]
Out[48]:
student_num ACT_Score
8 56.0 0.0
16 64.0 0.0
51 62.0 0.0
78 8.0 0.0
187 443.0 0.0
311 68.0 0.0
332 46.0 0.0
340 149.0 0.0
353 502.0 0.0
463 59.0 0.0
465 347.0 0.0

School 8 is Alexander Early College (https://www.alexander.k12.nc.us/aec). This would explain why its class size (56) is less than 10% of the average school size in the district (Alexander County Schools, 727). They also advertise that their 4-5 year program simultaneously earns its students a high school diploma along with up to two years of college credits or an Associate's Degree in Art or Science. This would explain why their small student class does not take an ACT exam.

School 16 is Avery High Viking Academy, out of the Avery County Schools district. It also has a small class size (64) compared to its district's average of 195. This school does not have a listed url so additional background information for context cannot be obtained.

School 51 is the Cabarrus Early College of Technology. Like the other schools listed it has a very small class size (62) compared to the average school size in the district (Cabarrus County Schools, 1106). A URL was not provided with the data set but a Google search yielded https://www.cabarrus.k12.nc.us/Domain/8438. Like with Alexander Early College, Cabarrus Early College of Techonlogy offers a similar program that students will simultaneously earn a high school diploma with an Associate's Degree with a focus on STEM (science, technology, engineering, and mathematics). Again, this would explain why the students do not take the ACT, because they have already obtained college credits that it is not needed.

School 78 is the Chatham School of Science & Engineering (https://www.chatham.k12.nc.us/Domain/1813) inside of the Chatham County Schools. It also provides its students the opportunity to obtain an Associate's Degree along side their High School Diploma.

School 187 is Doris Henderson Newcomers SChool (https://www.gcsnc.com/domain/757) inside of the Guilford County Schools. This school actually has a much larger class size (443) than its district's average (193). It is a school for grades 3-12 for immigrant and refugee students.

School 311 is Northampton Early College (https://northamptonec.sharpschool.com/) in the Northampton County Schools district. It looks to be another Associate's Degree program within the high school curriculum.

School 332 is Person Early College Innovation & Leadership in the Person County Schools. It also is a five year program that allows students to obtain an Associate's Degree alongside their High School Diploma.

School 340 is Early College High School (https://www.pitt.k12.nc.us/domain/890) in the Pitt County Schools district. It also allows students to obtain a Associate's Degree along with their High School Diploma.

School 353 is Richmond County 9th Grade Academy (http://www.richmond.k12.nc.us/RCNGA/) in the Richmond County Schools district.

School 463 is Wilson Academy of Applied Technology (https://waat.wilsonschoolsnc.net/) in the Wilson County Schools district. It is also a combined High School Diploma and Associate's Degree program.

School 465 is Boonville Elementary in the Yadkin County Schools district. It is not a high school and will be deleted.

Based on the analysis above, we decided to remove these instances.

In [49]:
dfDropped = dfDropped[dfDropped['ACT_Score'] != 0]

Outlier Data

Earlier exploratory analysis did not reveal outlier data, so no additional data processing was done at this stage.

Back to Top

Data Understanding 2

Focusing on Top Performing (Q4) and Bottom Performing (Q1) School Quartiles

In order to determine which factors have the largest impact on ACT scores, we are going to focus on the 1st and 4th quartiles of schools. Focusing on the two extremes will allow for a sharper contrast between those schools performing at the top and those performing near the bottom. The features identified from this analysis will allow the North Carolina public schools governing body to invest in ways to improve these metrics for schools performing in the 1st quartile.

In [50]:
qSplit = dfDropped['ACT_Score'].quantile([.25, .50, .75, 1])
dfDropped["Q25"] = np.where(dfDropped['ACT_Score'] <= qSplit[.25], 1.0, 0.0)
dfDropped["Q50"] = np.where(dfDropped['ACT_Score'] <= qSplit[.50], 1.0, 0.0)
dfDropped["Q75"] = np.where(dfDropped['ACT_Score'] <= qSplit[.75], 1.0, 0.0)
dfDropped["Q100"] = np.where(dfDropped['ACT_Score'] <= qSplit[1], 1.0, 0.0)
In [51]:
fig = plt.figure()
ax = fig.add_subplot()

plt.boxplot(dfDropped['ACT_Score'])
Out[51]:
{'whiskers': [<matplotlib.lines.Line2D at 0x1f3fceea828>,
  <matplotlib.lines.Line2D at 0x1f3fceea588>],
 'caps': [<matplotlib.lines.Line2D at 0x1f3820d5ba8>,
  <matplotlib.lines.Line2D at 0x1f3820d54a8>],
 'boxes': [<matplotlib.lines.Line2D at 0x1f3fceeacc0>],
 'medians': [<matplotlib.lines.Line2D at 0x1f3820d50b8>],
 'fliers': [<matplotlib.lines.Line2D at 0x1f3820d5400>],
 'means': []}

With more than 400 feature attributes, we needed to narrow down the data set optimize our analysis. We elected to use Recurive Feature Eliminiation to identify the most influential features impacting ACT Score.

Recursive Feature Elimination (RFE)

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature(s) until the specified number of features is reached. It is a well-known and commonly used selection technique, and we decided it was the optimal way to select the data set we ultimately wanted to work with for cluster analysis. It provided the best balance of computational efficiency and evaluative rigor. We acknowledge that it may exclude some features that perhaps should have been retained, but we felt the trade off was fair.

In RFE, features are ranked by the model’s coefficients, or feature importance values. The model recursively eliminates a small number of features per loop by attempts to eliminate dependencies and collinearity that may exist in the model. RFE requires a specified number of features to keep, however it is often not known in advance how many features are valid. To find the optimal number of features, cross-validation is used with RFE to score different feature subsets and select the best scoring collection of features.

This collection of attributes represents different aspects of our schools, including overall school metrics, test scores, teacher qualifications, demographics, and student behavior. Based on the team's knowledge acquired over the course of the semester working with the NC high school data, we believe this is an appropriate set to be working with for our analysis and explore each in more detail below. It is interesting to note that many of these attributes related to End of Course test scores that meet the College and Career Ready standards in subjects that part of the ACT test - English, math, reading, and science reasoning.

Citation: http://www.scikit-yb.org/en/latest/api/features/rfecv.html

In [52]:
cv = ShuffleSplit(n_splits=10, test_size  = 0.2, random_state = 42)
X = dfDropped.drop(columns=['ACT_Score'], axis = 1)
y = dfDropped['ACT_Score']
In [53]:
from sklearn.feature_selection import SelectPercentile, f_regression, mutual_info_regression

p = 20
selectf_reg = SelectPercentile(f_regression, percentile=p).fit(X, y)
select_mutual = SelectPercentile(mutual_info_regression, percentile=p).fit(X, y)

f_reg_selected = selectf_reg.get_support()
f_reg_selected_features = [ f for i,f in enumerate(X.columns) if f_reg_selected[i]]
print('f_Regression selected {} features.'.format(f_reg_selected.sum()))

mutual_selected = select_mutual.get_support()
mutual_selected_features = [ f for i,f in enumerate(X.columns) if mutual_selected[i]]
print('Mutual Info selected {} features.'.format(mutual_selected.sum()))

selected = f_reg_selected & mutual_selected
print('Intersection of F_Regression & Mutual Info Regression: {} features'.format(selected.sum()))
featuresFull = [ f for f,s in zip(X.columns, selected) if s]
f_Regression selected 64 features.
Mutual Info selected 64 features.
Intersection of F_Regression & Mutual Info Regression: 35 features
In [23]:
X_train, X_test, y_train, y_test = train_test_split(X[featuresFull], y, test_size=.2)

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

selector = RFECV(RandomForestRegressor(n_estimators = 10, max_depth = 3, random_state=42, n_jobs=-1), cv=cv, scoring='explained_variance', verbose = 1, n_jobs=-1)
selector.fit(X_train, y_train)

print('The optimal number of features is {}'.format(selector.n_features_))
features = [f for f,s in zip(X_train.columns, selector.support_) if s]
The optimal number of features is 12

We determined the attributes below to have the most impact on class seperation and felt the resulting attributes were much more manageable for our cluster analysis.

FEATURE DESCRIPTION
English_II_Size Average class size for English II class at school
SPG_Score School Performance Grade Score
EVAAS_Growth_Score Education Value-Added Assessment System Growth Score
NC_Math_1_Score Average score by school for NC Math 1
Passing_NC_Math_3 Average score by school for Passing NC Math 3
EOCSubjects_CACR_All Average score by school for all students taking End of Course Subjects tests based on College and Career Ready standards
EOCEnglish2_CACR_Male Average score by school for males taking End of Course English II tests based on College and Career Ready standards
EOCEnglish2_CACR_White Average score by school for white students taking End of Course English II tests based on College and Career Ready standards
EOCMathI_CACR_White Average score by school for white students taking End of Course Math I tests based on College and Career Ready standards
EOCSubjects_CACR_White Average score by school of white students taking End of Course Subjects tests based on College and Career Ready standards
EOCBiology_CACR_SWD Average score by school of students with disabilities taking End of Course Biology tests based on College and Career Ready standards
st_short_susp_per_c_num Number of Short term suspensions per 100 students at the State level
nbpts_num Number of National Board Certified Staff at school level
pct_eds Percentage of economically disadvantaged students
In [54]:
featCols = ['English_II_Size', 'SPG_Score', 'EVAAS_Growth_Score', 'NC_Math_1_Score', 
            'Passing_NC_Math_3', 'EOCSubjects_CACR_All', 'EOCEnglish2_CACR_Male',
            'EOCEnglish2_CACR_White', 'EOCMathI_CACR_White', 'EOCSubjects_CACR_White',
            'EOCBiology_CACR_SWD', 'st_short_susp_per_c_num', 'nbpts_num', 'pct_eds']

for feat in featCols:
    fig = plt.figure()
    fig.suptitle(feat)

    ax = fig.add_subplot(131)
    plt.boxplot(dfDropped[feat], showmeans=True)
    ax.set_xlabel('All Schools')

    ax = fig.add_subplot(132, sharey=ax)
    btmQ = dfDropped[dfDropped['Q25'] == 1]
    plt.boxplot(btmQ[feat], showmeans=True)
    ax.set_xlabel('Bottom Quartile Schools')

    ax = fig.add_subplot(133, sharey=ax)
    topQ = dfDropped[dfDropped['Q100'] == 1]
    plt.boxplot(topQ[feat], showmeans=True)
    ax.set_xlabel('Top Quartile Schools')

English_II_Size The mean for all schools is 19.1. The mean for all schools in the bottom quartile is smaller since it's around 17. The box plot for all schools in the bottom quartile is lower than the box plot for all schools and top quartile schools as expected.

SPG_Score The mean School Performance Grade for all schools is 72.7. The mean for all schools in the bottom quartile is much lower than the mean for all schools with a score of 60. Surprisingly the mean for all schools in the top quartile is around the same as the mean across all schools. The box plot for all schools looks almost identical to the box plot for the top quartile.

EVAAS_Growth_Score The mean and median EVAAS Growth Score is lower for the bottom quartile schools than the top quartile schools. This shows that schools in the bottom quartile based on their ACT score has on average lower EVAAS Growth Scores than other schools.

NC_Math_1_Score The mean NC Math 1 score and max value are much lower (20 points) for the bottom quartile schools than all schools or top quartile schools. The direct correlation betwen the lower score and the bottom quartile is expected since the quartiles are based on ACT score and Math is tested on the ACT.

Passing_NC_Math_3 The majority of schools have scores close to a 100 for this variable. There are outlier schools that have scores of 0 that aren't depicted in the box plot since there are so few.

EOCSubjects_CACR_All The max score and the mean and median for the all End of Course subjects that met the career and college readiness standard score for the bottom quartile schools is lower than all schools. The box plot for all schools is identical to the box plot for the top quartile.

EOCEnglish2_CACR_Male The mean English 2 score for male students evaluated on meeting the career and college readines standard for the bottom quartile is much lower than the mean for all schools. Since Reading and English are both sections on the ACT exam, it makes sense that a bottom quartile school is much lower than all schools. The EOC English 2 score could be correlated with the ACT score.

EOCEnglish2_CACR_White The mean English 2 score for students who are ethnically white is higher than the mean English 2 score for all males. The mean for the bottom quartile schools is lower than the mean for all schools. The mean, median, Q2, and Q4 values on the box plot for white students are roughly 10 points higher than all male students.

EOCMath1_CACR_White The mean Math 1 score for white students in bottom quartile schools is much lower by 20 points than male students in all schools. The box plot for all schools and top quartile schools look identical to each other. There is no discernable difference between the two plots. It makes sense that the bottom quartile schools have a lower mean score for the Math 1 End of Course exam because math is a section on the ACT exam and the bottom quartile schools is determined by ACT score.

EOCSubjects_CACR_White The mean for all End of Course subject test scores for white students in bottom quartile schools is lower than the mean for all schools and top quartile schools. It is interesting to note that the IQR for bottom quartile schools is smaller indicating a smaller spread than the other box plots.

EOCBiology_CACR_SWD The mean for EOC Biology course subject test scores for students with disabilties is even across all the box plots. This indicates that the quartile based on ACT score has no correlation to students with disabilities. The max quartile value is close to 40 for all schools and quartile schools. For bottom quartile schools, the max quartile value is 30. The minimum quartile value is zero. This could indicate that students with disabilities may not be required to take the exam hence the score of zero.

st_short_susp_per_c_num The boxplot is identical across all schools whih indicate that there is no difference in number of short term suspensions per 100 students in relation to bottom or top quartiles based on ACT score. This indicates that there is zero correlation between ACT score and short term suspensions.

nbpts_num The mean number of National Board Certified Staff at the school at bottom quartile schools is smaller than the mean at all schools and top quartile schools. This could indicate that there are fewer National Board Certified staff members at bottom quartile schools than other schools.

pct_eds The mean percentage of economically disadvantaged students are higher at bottom quartile schools than all schools and top quartile schools. This implies that economically disadvantaged students in the NC public school system are more likely to score lower on the ACT than students who aren't economically disadvantaged. The mean percentage of economically disadvantaged students is less than 50% at top quartile schools.

In [55]:
breaks = np.asarray(np.percentile(dfDropped['ACT_Score'], [25,50,75,100]))
dfDropped['ACT_Score_Quartiles'] = (dfDropped['ACT_Score'].values > breaks[..., np.newaxis]).sum(0)
In [56]:
quartile = {0: "First Q", 1: "Second Q", 2: "Third Q", 3: "Fourth Q"}

# scatter plot code from: https://stackoverflow.com/questions/21654635/scatter-plots-in-pandas-pyplot-how-to-plot-by-category
groups = dfDropped.groupby('ACT_Score_Quartiles')

for feat in featCols:
    fig, ax = plt.subplots()
    ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
    for name, group in groups:
        ax.plot(getattr(group, feat), group.ACT_Score, marker='o', 
                    linestyle='', ms=12, label=quartile[name])
    ax.legend()
    ax.set(title='ACT Score vs ' + feat, 
               xlabel=feat,
               ylabel='ACT Score')

    plt.show()
    plt.tight_layout()

Note: The top quartile schools are referred to as Fourth Q - Magenta and the bottom quartile schools are referred to as First Q - Blue in the scatter plots above.

ACT Score vs English II Size There seems to be no correlation between ACT Score for the different quartiles.

ACT Score vs SPG_Score The ACT Score is positively correlated with the school performance grade score. As the ACT score increases, the school performance grade score increases as well. The top quartile schools have a denser concentration of higher SPG scores than the bottom quartile (First Q - blue)

ACT Score vs EVAAS_Growth_Score There seems to be no correlation between ACT Score vs. EVAAS_Growth_Score for the different quartiles.

ACT Score vs. NC_Math_1_Score The ACT Score is positively correlated with the NC_Math_1 score. As the ACT score increaes, the NC_Math_1_Scores also increases. The top quartile schools have a more dense NC_Math_1 score than bottom quartile schools. There are also more schools in the top quartile with 100 NC_Math_1_Scores than the other quartiles.

ACT Score vs. Passing_NC_Math_3 There seems to be no correlation between ACT Score vs. Passing_NC_Math_3.

ACT Score vs. EOCSubjects_CACR_All The ACT Score is positively correlated with EOCSubjects_CACR_All. As the ACT score increases, the EOCSubjects_CACR_All score also increases.

ACT Score vs. EOCEnglish2_CACR_Male The ACT Score is positively correlated with EOCEnglish2_CACR_Male. As the ACT score increases, the EOCEnglish2_CACR_Male score also increases.

ACT Score vs EOCEnglish2_CACR_White The ACT Score is positively correlated with EOCEnglish2_CACR_White. As the ACT score increases, the EOCEnglish2_CACR_White score also increases. The correlation for this scatter plot appears to be more spread out than the plot for the males above.

ACT Score vs EOCMathI_CACR_White The ACT Score is positively correlated with EOCMathI_CACR_White. As the ACT score increases, the EOCMathI_CACR_White score also increaes. The correlation for this scatter plot seems to be more spread out than the above plot. This indicates it's a weaker positive correlation.

ACT Score vs EOCSubjects_CACR_White The ACT Score is positively correlated with EOCSubjects_CACR_White. As the ACT score increases, the EOCSubjects_CACR_White score also increases. The correlation is stronger since it is less spread out.

ACT Score vs EOCBiology_CACR_SWD There appears to be no correlation between ACT Score and EOCBiology_CACR_SWD (students with disabilities)

ACT Score vs st_short_susp_per_c_num There appears to be no correlation between ACT Score and st_short_susp_per_c_num.

ACT Score vs nbpts_num There appears to be a weak positive correlation between ACT Score and nbpts_num. As the nbpts_num (number of Nationally Board Certified staff) increases the ACT Score increases.

ACT Score vs pct_eds There appears to be a negative correlation between ACT Score and pct_eds (percentage of economically disadvantaged students). As the ACT score decreases, the percentage of economically disadvantaged students increases.

Data Summary

We will be splitting the data into two sets. One set will include all public schools and the other set will include only those schools that are in the top or bottom quartile for ACT scores. This will allow us to visualize the sharper contrast between schools in the top and bottom quartile. If we left the middle quartiles in the data, our models would try to generalize to accomodate schools in the middle which will cause our models to perform less accurately. With the sharp contrast, the models won't have to generalize to accomodate data that isn't as clearly defined.

Back to Top

Modeling and Evaluation

For this analysis, multiple techniques were used to ultimately identify school features that will help them ultimately improve upon their performance. First, as noted above, we reduced the dimensionality of the data so that we could work with a much more manageable number of features without sacrificing performance. Now, we will apply four separate clustering algorithms to determine the best number of classifications that would improve our overall models. We will tune each parameter in Random Forest models to achieve the optimal classification. Finally, we applied the Random Forest algorithm to varying datasets to assess its performance and minimize any misclassifications.

The main clustering models we will use are:

K Means

K-Means is a clustering technique that attempts to find a find a user-defined number of clusters, K. The methodology used to create clusters is prototype-based and partitional, where the specific object being clustered is more similar (also closer) to the object that defines the cluster and where objects are simply divided into non-overlapping subsets, or clusters.

A more technical way of defining K-Means is that the algorithm is trying to separate objects into n groups of equal variances, while minimizing the within-cluster sum-of-squares. It is an iterative process, where the results is usually a local, optimal value but not necessarily a global optimum. The distance measurements available include Euclidean, Manhattan, and cosine; the centroid measurements are commonly median or mean.

Mini-Batch K-Means

Mini-Batch K-Means clustering is a variant of K-Means but creates subsets of data that have been randomly sampled during each training iteration. This process reduces the time needed for computation while still optimizing the same K-Means function. It tends to converge faster, but the results quality in practice may be slightly reduced than in the full K-Means algorithm.

DBSCAN

DBSCAN is a density-based algorithm that clusters dense regions of objects (core points and border points) surrounded by low density regions. The number of clusters is determined by the algorithm automatically. The lower-density regions are considered noise points and omitted from the results, which produces partial clusters that do not contain all data points. Overall, DBSCAN clusters can be any shape. Important DBSCAN definitions and parameters include:

  • Core points: the interior points of a density-based cluster, where a point is defined as “core” if the number of points within a given neighborhood is within a specified distance
  • Border points: a point that falls within the neighborhood of a core point
  • Noise points: a point that is not a core point or border point
  • Eps: a user-defined distance parameter, maximum distance between two samples for them to be considered as in the same neighborhood
  • MinPts: a user-defined parameter threshold, number of samples in a neighborhood for a point to be considered as a core point, including the point itself

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering is a collection of similarly related techniques that produce a hierarchical cluster (a set of nested clusters that are organized as a tree). The algorithm starts with each data point as a single cluster and then repeatedly merges the two closest until a single, all-encompassing cluster remains.

Summary/Snapshot of Clustering Methods

Method name Parameters Scalability Best Use Case Geometry/Measurement Metric
K-Means number of clusters Very large n_samples, medium n_clusters with MiniBatch code General-purpose, even cluster size, flat geometry, not too many clusters Distances between points
DBSCAN neighborhood size Very large n_samples, medium n_clusters Non-flat geometry, uneven cluster sizes Distances between nearest points
Agglomerative clustering number of clusters, linkage type, distance Large n_samples and n_clusters Many clusters, possibly connectivity constraints, non Euclidean distances Any pairwise distance
Ward hierarchical clustering number of clusters Large n_samples and n_clusters Many clusters, possibly connectivity constraints Distances between points

Back to Top

TRAIN AND ADJUST PARAMETERS (10 points)

Model # Cluster Algorithm Data Instances
Cluster 1 K-Means All Quartiles
Cluster 2 MiniBatchK-Means All Quartiles
Cluster 3 DBSCAN All Quartiles
Cluster 4 Agglomerative All Quartiles
Cluster 5 K-Means Top & Bottom Quartiles Only
Cluster 6 MiniBatchK-Means Top & Bottom Quartiles Only
Cluster 7 DBSCAN Top & Bottom Quartiles Only
Cluster 8 Agglomerative Top & Bottom Quartiles Only

Cluster Tuning Parameters

Parameters Description Range
Number of Clusters How many clusters to form 2 - 10
Minimum Samples Minimum number of instances to become a cluster 10 - 20
Epsilon Maximum distance from cluster for instance to stay 15 - 25
Link Distance to use for instances to stay in cluster Ward, Complete, Average

All Quartiles

In [57]:
# create cross validation variable
cv = ShuffleSplit(n_splits=10, test_size=.2, random_state=42)

# training data
X = dfDropped[featCols]

# target variable
y = dfDropped['ACT_Score_Quartiles']

# Classifier
cls = RandomForestClassifier(random_state = 42)

# Results Dictionary
bestResults = {}

Cluster 1

In [58]:
bestAcc = 0
bestSil = -2

for kSize in np.arange(2, 10):
    # Instantiate the clustering model
    model = KMeans(n_clusters = kSize, random_state = 42)
    model.fit(X)
    cLabels = model.labels_
    
    X = np.column_stack((X, pd.get_dummies(cLabels)))
    result = scoreRF(X, y, cLabels)
      
    if result['mean'] > bestAcc:
        bestAcc = result['mean']
        bestRFkSize = kSize
        bestRFLabels = cLabels
        
    if result['sil'] > bestSil:
        bestSil = result['sil']
        bestSilkSize = kSize
        bestSilLabels = cLabels
        
bestResults['kmeans'] = bestAcc

print("Best Silhouette Score: \n\t K Size: {} \n\t Silhouette Score: {}".format(bestSilkSize, bestSil))
print("Best Accuracy Score: \n\t K Size: {} \n\t Result: {}".format(bestRFkSize, bestAcc))
Best Silhouette Score: 
	 K Size: 2 
	 Silhouette Score: 0.3644676725879654
Best Accuracy Score: 
	 K Size: 2 
	 Result: 0.6956521739130435

Cluster 2

In [32]:
bestAcc = 0
bestSil = -2

for kSize in np.arange(2, 10):
    # Instantiate the clustering model
    model = MiniBatchKMeans(n_clusters = kSize, random_state = 42)
    model.fit(X)
    cLabels = model.labels_
    
    X = np.column_stack((X, pd.get_dummies(cLabels)))
    result = scoreRF(X, y, cLabels)

    if result['mean'] > bestAcc:
        bestAcc = result['mean']
        bestRFkSize = kSize
        bestRFLabels = cLabels
        
    if result['sil'] > bestSil:
        bestSil = result['sil']
        bestSilkSize = kSize
        bestSilLabels = cLabels
        
bestResults['batchkmeans'] = bestAcc

print("Best Silhouette Score: \n\t K Size: {} \n\t Silhouette Score: {}".format(bestSilkSize, bestSil))
print("Best Accuracy Score: \n\t K Size: {} \n\t Accuracy: {}".format(bestRFkSize, bestAcc))
Best Silhouette Score: 
	 K Size: 2 
	 Silhouette Score: 0.36481868690643277
Best Accuracy Score: 
	 K Size: 9 
	 Accuracy: 0.6815217391304348

Cluster 3

In [33]:
bestAcc = 0
bestSil = -2

for minS in np.arange(10, 20):
    for eps in np.arange(15, 25):
        # Instantiate the clustering model
        model = DBSCAN(eps = eps, min_samples=minS, n_jobs=-1)
        model.fit(X)
        cLabels = model.labels_

        X = np.column_stack((X, pd.get_dummies(cLabels)))
        result = scoreRF(X, y, cLabels)

        #print("Eps: {} \t Min Samples: {} \t Result: {}".format(eps, minS, result['mean']))
        
    if result['mean'] > bestAcc:
        bestAcc = result['mean']
        bestRFMinS = minS
        bestRFEps = eps
        bestRFLabels = cLabels
        
    if result['sil'] > bestSil:
        bestSil = result['sil']
        bestSilMinS = minS
        bestSilEps = eps
        bestSilLabels = cLabels
        
bestResults['dbscan'] = bestAcc

print("Best Silhouette Score: \n\t Min Samples: {} \n\t Eps: {} \n\t Silhouette Score: {}".format(bestSilMinS, bestSilEps, bestSil))
print("Best Accuracy Score: \n\t Min Samples: {} \n\t Eps: {} \n\t Accuracy: {}".format(bestRFMinS, bestRFEps, bestAcc))
Best Silhouette Score: 
	 Min Samples: 10 
	 Eps: 24 
	 Silhouette Score: 0.22018844966327106
Best Accuracy Score: 
	 Min Samples: 10 
	 Eps: 24 
	 Accuracy: 0.658695652173913

Cluster 4

In [34]:
bestAcc = 0
bestSil = -2

for kSize in np.arange(2, 10):
    for link in ['ward', 'complete', 'average']:
        # Instantiate the clustering model
        model = AgglomerativeClustering(n_clusters = kSize, linkage = link)
        model.fit(X)
        cLabels = model.labels_

        X = np.column_stack((X, pd.get_dummies(cLabels)))
        result = scoreRF(X, y, cLabels)

        #print("Eps: {} \t Min Samples: {} \t Result: {}".format(eps, minS, result['mean']))
        
        if result['mean'] > bestAcc:
            bestAcc = result['mean']
            bestRFkSize = kSize
            bestRFLink = link
            bestRFLabels = cLabels
        
        if result['sil'] > bestSil:
            bestSil = result['sil']
            bestSilkSize = kSize
            bestSilLink = link
            bestSilLabels = cLabels
        
bestResults['agg'] = bestAcc

print("Best Silhouette Score: \n\t K Size: {} \n\t Link: {} \n\t Silhouette Score: {}".format(bestSilkSize, bestSilLink, bestSil))
print("Best Accuracy Score: \n\t K Size: {} \n\t Link: {} \n\t Accuracy: {}".format(bestRFkSize, bestRFLink, bestAcc))
Best Silhouette Score: 
	 K Size: 2 
	 Link: average 
	 Silhouette Score: 0.5002561413478372
Best Accuracy Score: 
	 K Size: 9 
	 Link: ward 
	 Accuracy: 0.6554347826086957

Top & Bottom Quartiles Only

In [59]:
dfExtremes = dfDropped[(dfDropped['ACT_Score_Quartiles'] == 0) | (dfDropped['ACT_Score_Quartiles'] == 3)]

X = dfExtremes[featCols]

y = dfExtremes['ACT_Score_Quartiles']

Cluster 5

In [36]:
bestAcc = 0
bestSil = -2

for kSize in np.arange(2, 10):
    # Instantiate the clustering model
    model = KMeans(n_clusters = kSize, random_state = 42)
    model.fit(X)
    cLabels = model.labels_
    
    X = np.column_stack((X, pd.get_dummies(cLabels)))
    result = scoreExtremeRF(X, y, cLabels)
      
    if result['mean'] > bestAcc:
        bestAcc = result['mean']
        bestRFkSize = kSize
        bestRFLabels = cLabels
        
    if result['sil'] > bestSil:
        bestSil = result['sil']
        bestSilkSize = kSize
        bestSilLabels = cLabels
        
bestResults['extreme_kmeans'] = bestAcc

print("Best Silhouette Score: \n\t K Size: {} \n\t Silhouette Score: {}".format(bestSilkSize, bestSil))
print("Best Accuracy Score: \n\t K Size: {} \n\t Result: {}".format(bestRFkSize, bestAcc))
Best Silhouette Score: 
	 K Size: 2 
	 Silhouette Score: 0.5017249673944034
Best Accuracy Score: 
	 K Size: 2 
	 Result: 0.9851063829787234

Cluster 6

In [44]:
bestAcc = 0
bestSil = -2

for kSize in np.arange(2, 10):
    # Instantiate the clustering model
    model = MiniBatchKMeans(n_clusters = kSize, random_state = 42)
    model.fit(X)
    cLabels = model.labels_
    
    X = np.column_stack((X, pd.get_dummies(cLabels)))
    result = scoreExtremeRF(X, y, cLabels)

    if result['mean'] > bestAcc:
        bestAcc = result['mean']
        bestRFkSize = kSize
        bestRFLabels = cLabels
        
    if result['sil'] > bestSil:
        bestSil = result['sil']
        bestSilkSize = kSize
        bestSilLabels = cLabels
        
bestResults['extreme_batchkmeans'] = bestAcc

print("Best Silhouette Score: \n\t K Size: {} \n\t Silhouette Score: {}".format(bestSilkSize, bestSil))
print("Best Accuracy Score: \n\t K Size: {} \n\t Accuracy: {}".format(bestRFkSize, bestAcc))
Best Silhouette Score: 
	 K Size: 2 
	 Silhouette Score: 0.5014952813546882
Best Accuracy Score: 
	 K Size: 3 
	 Accuracy: 0.9872340425531915

Cluster 7

In [38]:
bestAcc = 0
bestSil = -2

for minS in np.arange(10, 20):
    for eps in np.arange(15, 25):
        # Instantiate the clustering model
        model = DBSCAN(eps = eps, min_samples=minS, n_jobs=-1)
        model.fit(X)
        cLabels = model.labels_

        X = np.column_stack((X, pd.get_dummies(cLabels)))
        result = scoreExtremeRF(X, y, cLabels)

        #print("Eps: {} \t Min Samples: {} \t Result: {}".format(eps, minS, result['mean']))
        
    if result['mean'] > bestAcc:
        bestAcc = result['mean']
        bestRFMinS = minS
        bestRFEps = eps
        bestRFLabels = cLabels
        
    if result['sil'] > bestSil:
        bestSil = result['sil']
        bestSilMinS = minS
        bestSilEps = eps
        bestSilLabels = cLabels
        
bestResults['extreme_dbscan'] = bestAcc

print("Best Silhouette Score: \n\t Min Samples: {} \n\t Eps: {} \n\t Silhouette Score: {}".format(bestSilMinS, bestSilEps, bestSil))
print("Best Accuracy Score: \n\t Min Samples: {} \n\t Eps: {} \n\t Accuracy: {}".format(bestRFMinS, bestRFEps, bestAcc))
Best Silhouette Score: 
	 Min Samples: 18 
	 Eps: 24 
	 Silhouette Score: 0.0814370539597228
Best Accuracy Score: 
	 Min Samples: 10 
	 Eps: 24 
	 Accuracy: 0.9829787234042554

Cluster 8

In [46]:
bestAcc = 0
bestSil = -2

for kSize in np.arange(2, 10):
    for link in ['ward', 'complete', 'average']:
        # Instantiate the clustering model
        model = AgglomerativeClustering(n_clusters = kSize, linkage = link)
        model.fit(X)
        cLabels = model.labels_

        X = np.column_stack((X, pd.get_dummies(cLabels)))
        result = scoreExtremeRF(X, y, cLabels)

        #print("Eps: {} \t Min Samples: {} \t Result: {}".format(eps, minS, result['mean']))
        
        if result['mean'] > bestAcc:
            bestAcc = result['mean']
            bestRFkSize = kSize
            bestRFLink = link
            bestRFLabels = cLabels
        
        if result['sil'] > bestSil:
            bestSil = result['sil']
            bestSilkSize = kSize
            bestSilLink = link
            bestSilLabels = cLabels
        
bestResults['extreme_agg'] = bestAcc

print("Best Silhouette Score: \n\t K Size: {} \n\t Link: {} \n\t Silhouette Score: {}".format(bestSilkSize, bestSilLink, bestSil))
print("Best Accuracy Score: \n\t K Size: {} \n\t Link: {} \n\t Accuracy: {}".format(bestRFkSize, bestRFLink, bestAcc))
Best Silhouette Score: 
	 K Size: 2 
	 Link: ward 
	 Silhouette Score: 0.4954828095343213
Best Accuracy Score: 
	 K Size: 2 
	 Link: ward 
	 Accuracy: 0.9872340425531915
In [47]:
pprint(bestResults)
print("Best clustering method: {}".format(max(bestResults, key=bestResults.get)))
{'agg': 0.6554347826086957,
 'batchkmeans': 0.6815217391304348,
 'dbscan': 0.658695652173913,
 'extreme_agg': 0.9872340425531915,
 'extreme_batchkmeans': 0.9872340425531915,
 'extreme_dbscan': 0.9829787234042554,
 'extreme_kmeans': 0.9851063829787234,
 'kmeans': 0.6956521739130435}
Best clustering method: extreme_batchkmeans
In [48]:
print("Based on the accuracy of each model '{}' is the best clustering method for this data.".format(max(bestResults, key=bestResults.get)))
Based on the accuracy of each model 'extreme_batchkmeans' is the best clustering method for this data.
In [60]:
model = MiniBatchKMeans(n_clusters = 2, random_state = 42)
model.fit(dfExtremes[featCols])
cLabels = model.labels_
In [50]:
# Data
X_Dropped = dfDropped[featCols]

# Target
y_Dropped = dfDropped['ACT_Score_Quartiles']

# Training and testing data
X_train_Dropped, X_test_Dropped, y_train_Dropped, y_test_Dropped = train_test_split(X_Dropped, y_Dropped, random_state=42, test_size=.2)

# Data
X_extreme = dfExtremes[featCols]

# target
y_extreme = dfExtremes['ACT_Score_Quartiles']

# training and testing data
X_train_extreme, X_test_extreme, y_train_extreme, y_test_extreme = train_test_split(X_extreme, y_extreme, random_state=42, test_size=.2)

# Add clustering labels into data
dfCluster = np.column_stack((X_extreme, pd.get_dummies(cLabels)))

X_cluster = dfCluster

y_cluster = dfExtremes['ACT_Score_Quartiles']

X_train_cluster, X_test_cluster, y_train_cluster, y_test_cluster = train_test_split(X_cluster, y_cluster, random_state=42, test_size=.2)

RandomForest Tuning Parameters

Parameters Description Range
Max Depth Max levels down from root 40 - 140
Min Sample Split Min samples before a split of samples should occur 2 - 10
Min Sample Leaf Min samples allowed to be considered a leaf 1 - 10
Estimators Number of trees to generate 200 - 2000

RF Model Parameter Tuning

For each model, four features are tuned to determine the best results. The blue line identifies how well the model performs on the training data. A score closer to 1 is best. The green curve shows how the model performs on the test data. Again a score closer to 1 is best. By looking at these curves we can identify the optimum value for the tuning parameter and when the model starts to overfit the data. An example of overfitting would be where the blue curve starts to track upward and the green curve starts to track downward.

RF Model 1

Estimators

In [38]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'n_estimators',
    param_range = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
    cv = cv, scoring = 'accuracy', n_jobs = -1
)

viz.fit(X_train_Dropped, y_train_Dropped)
viz.poof()

Max Depth

In [220]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'max_depth',
    param_range = [int(x) for x in np.linspace(40, 140, num = 11)], 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_Dropped, y_train_Dropped)
viz.poof()

Minimum Sample Split

In [39]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'min_samples_split',
    param_range = np.arange(2, 10), 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_Dropped, y_train_Dropped)
viz.poof()

Minimum Samples per Leaf

In [224]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'min_samples_leaf',
    param_range = np.arange(1, 10), 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_Dropped, y_train_Dropped)
viz.poof()

RF Model 2

Estimators

In [45]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'n_estimators',
    param_range = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
    cv = cv, scoring = 'accuracy', n_jobs = -1
)

viz.fit(X_train_extreme, y_train_extreme)
viz.poof()

Max Depth

In [46]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'max_depth',
    param_range = [int(x) for x in np.linspace(40, 140, num = 11)], 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_extreme, y_train_extreme)
viz.poof()

Mininum Simples for Split

In [47]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'min_samples_split',
    param_range = np.arange(2, 10), 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_extreme, y_train_extreme)
viz.poof()

Minumum Samples per Leaf

In [51]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'min_samples_leaf',
    param_range = np.arange(1, 10), 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_extreme, y_train_extreme)
viz.poof()

RF Model 3

Estimators

In [52]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'n_estimators',
    param_range = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
    cv = cv, scoring = 'accuracy', n_jobs = -1
)

viz.fit(X_train_cluster, y_train_cluster)
viz.poof()

Max Depth

In [53]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'max_depth',
    param_range = [int(x) for x in np.linspace(40, 140, num = 11)], 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_cluster, y_train_cluster)
viz.poof()

Minimum Samples for Split

In [54]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'min_samples_split',
    param_range = np.arange(2, 10), 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_cluster, y_train_cluster)
viz.poof()

Minimum Samples per Leaf

In [55]:
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'min_samples_leaf',
    param_range = np.arange(1, 10), 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_cluster, y_train_cluster)
viz.poof()

Back to Top

EVALUATE AND COMPARE

Clustering Results on Feature Columns

Model # Cluster Algorithm Data Instances Cluster Size Silhouette Score Accuracy Score
Cluster 1 K-Means All Quartiles 2 0.364 0.696
Cluster 2 MiniBatchK-Means All Quartiles 2 0.365 0.682
Cluster 3 DBSCAN All Quartiles N/A 0.220 0.659
Cluster 4 Agglomerative All Quartiles 2 0.500 0.655

Clustering on Top and Bottom Quartiles of Schools, based on ACT Scores

Model # Cluster Algorithm Data Instances Kernel Size Silhouette Score Accuracy Score
Cluster 5 K-Means Top & Bottom Quartiles Only 2 0.502 0.985
Cluster 6 MiniBatchK-Means Top & Bottom Quartiles Only 2 0.501 0.987
Cluster 7 DBSCAN Top & Bottom Quartiles Only N/A 0.081 0.983
Cluster 8 Agglomerative Top & Bottom Quartiles Only 2 0.497 0.987

Random Forest Tuning

Model Data Instances Estimator Depth Min Samples Min Samples Leaf
RF Model 1 All Quartiles 200 N/A 8 9
RF Model 2 Top & Bottom Quartiles N/A N/A N/A 3
RF Model 3 Top, Bottom Quartiles, & Cluster Labels 200 N/A N/A N/A

RF Model Comparison Results

Model Data Instances Accuracy Precision Recall f1 Score
RF Model 1 All Quartiles 0.732 0.733 0.725 0.725
RF Model 2 Top & Bottom Quartiles 1 1 1 1
RF Model 3 Top, Bottom Quartiles, & Cluster Labels 1 1 1 1

RF Model Evaluation

In [54]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'log2', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(70, 140, num = 8)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 4, 6, 12]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

RF Model 1

In [55]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rfc = RandomForestClassifier(random_state = 24)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rfc_randomCV = RandomizedSearchCV(estimator = rfc, 
                                  param_distributions = random_grid, 
                                  n_iter = 100, cv = cv, verbose = 2, 
                                  random_state = 18, n_jobs = -1, 
                                  scoring = 'accuracy')

# Fit the random search model
rfc_randomCV.fit(X_train_Dropped, y_train_Dropped)
Fitting 10 folds for each of 100 candidates, totalling 1000 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:   12.7s
[Parallel(n_jobs=-1)]: Done 301 tasks      | elapsed:   36.4s
[Parallel(n_jobs=-1)]: Done 584 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  1.8min finished
Out[55]:
RandomizedSearchCV(cv=ShuffleSplit(n_splits=10, random_state=42, test_size=0.2, train_size=None),
          error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'log2', 'sqrt'], 'max_depth': [70, 80, 90, 100, 110, 120, 130, 140, None], 'min_samples_split': [2, 4, 6, 12], 'min_samples_leaf': [1, 2, 3], 'bootstrap': [True, False]},
          pre_dispatch='2*n_jobs', random_state=18, refit=True,
          return_train_score='warn', scoring='accuracy', verbose=2)
In [58]:
# examine the best model
print(rfc_randomCV.best_score_)
print(rfc_randomCV.best_params_)
print(rfc_randomCV.best_estimator_)
0.7324324324324324
{'n_estimators': 1200, 'min_samples_split': 12, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': 140, 'bootstrap': True}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=140, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=12,
            min_weight_fraction_leaf=0.0, n_estimators=1200, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)
In [59]:
# get best parameters from the gridsearch for the all quartile model
bestValues = rfc_randomCV.best_params_

print("Best parameters set found on development set: {}".format(bestValues))

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Set model to best values found
cls = RandomForestClassifier(bootstrap = bestValues['bootstrap'], 
                             max_depth = bestValues['max_depth'], 
                             max_features = bestValues['max_features'], 
                             min_samples_leaf = bestValues['min_samples_leaf'], 
                             min_samples_split = bestValues['min_samples_split'], 
                             n_estimators = bestValues['n_estimators'])

classFit = cls.fit(X_train_Dropped, y_train_Dropped)

y_hat = cls.predict(X_test_Dropped)

# Train
cm = ConfusionMatrix(classFit)

# Predict test values
cm.predict(X_test_Dropped)

cm.score(X_test_Dropped, y_test_Dropped)

cm.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Classification Report
vis = ClassificationReport(cls)
vis.fit(X_train_Dropped, y_train_Dropped)
vis.score(X_test_Dropped, y_test_Dropped)
vis.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Prediction Error Report
vis = ClassPredictionError(cls)

# Fit the training data to the visualizer
vis.fit(X_train_Dropped, y_train_Dropped)

# Evaluate the model on the test data
vis.score(X_test_Dropped, y_test_Dropped)

# Draw visualization
vis.poof()
Best parameters set found on development set: {'n_estimators': 1200, 'min_samples_split': 12, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': 140, 'bootstrap': True}
In [60]:
#feature importance of all quartile classification model

clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=80, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)

clf.fit(X_Dropped, y_Dropped)

feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(dfDropped.columns, clf.feature_importances_):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='barh', legend = False,
                                                  title = 'Gini Score')
plt.tight_layout()

RF Model 2

In [61]:
X_extreme = dfExtremes[featCols]
y_extreme = dfExtremes['ACT_Score_Quartiles']

X_train_extreme, X_test_extreme, y_train_extreme, y_test_extreme = train_test_split(X_extreme, y_extreme, random_state=42, test_size=.2)

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rfc = RandomForestClassifier(random_state = 24)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rfc_randomCV2 = RandomizedSearchCV(estimator = rfc, 
                                  param_distributions = random_grid, 
                                  n_iter = 100, cv = cv, verbose = 2, 
                                  random_state = 18, n_jobs = -1, 
                                  scoring = 'accuracy')

# Fit the random search model
rfc_randomCV2.fit(X_train_extreme, y_train_extreme)

# examine the best model
print(rfc_randomCV2.best_score_)
print(rfc_randomCV2.best_params_)
print(rfc_randomCV2.best_estimator_)
Fitting 10 folds for each of 100 candidates, totalling 1000 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:    8.1s
[Parallel(n_jobs=-1)]: Done 301 tasks      | elapsed:   26.4s
[Parallel(n_jobs=-1)]: Done 584 tasks      | elapsed:   50.9s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  1.4min finished
1.0
{'n_estimators': 2000, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 90, 'bootstrap': True}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=90, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=2000, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)
In [62]:
# examine the best model
print(rfc_randomCV2.best_score_)
print(rfc_randomCV2.best_params_)
print(rfc_randomCV2.best_estimator_)
1.0
{'n_estimators': 2000, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 90, 'bootstrap': True}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=90, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=2000, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)
In [63]:
# get best parameters from the gridsearch for the all quartile model
bestValues = rfc_randomCV2.best_params_

print("Best parameters set found on development set: {}".format(bestValues))

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Set model to best values found
cls = RandomForestClassifier(bootstrap = bestValues['bootstrap'], 
                             max_depth = bestValues['max_depth'], 
                             max_features = bestValues['max_features'], 
                             min_samples_leaf = bestValues['min_samples_leaf'], 
                             min_samples_split = bestValues['min_samples_split'], 
                             n_estimators = bestValues['n_estimators'])

classFit = cls.fit(X_train_extreme, y_train_extreme)

y_hat = cls.predict(X_test_extreme)

# Train
cm = ConfusionMatrix(classFit)

# Predict test values
cm.predict(X_test_extreme)

cm.score(X_test_extreme, y_test_extreme)

cm.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Classification Report
vis = ClassificationReport(cls)
vis.fit(X_train_extreme, y_train_extreme)
vis.score(X_test_extreme, y_test_extreme)
vis.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Prediction Error Report
vis = ClassPredictionError(cls)

# Fit the training data to the visualizer
vis.fit(X_train_extreme, y_train_extreme)

# Evaluate the model on the test data
vis.score(X_test_extreme, y_test_extreme)

# Draw visualization
vis.poof()
Best parameters set found on development set: {'n_estimators': 2000, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 90, 'bootstrap': True}
In [64]:
#feature importance of all quartile classification model

clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=80, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)

clf.fit(X_extreme, y_extreme)

feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(dfExtremes[featCols].columns, clf.feature_importances_):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='barh', legend = False,
                                                  title = 'Gini Score')
plt.tight_layout()

RF Model 3

In [65]:
# This includes dfDropped[featCols] + clustering labels
X_Cluster = dfCluster

# Just look at the Q1 and Q4 data
y_Cluster = dfExtremes['ACT_Score_Quartiles']

X_train_Cluster, X_test_Cluster, y_train_Cluster, y_test_Cluster = train_test_split(X_Cluster, y_Cluster, random_state=42, test_size=.2)

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rfc = RandomForestClassifier(random_state = 24)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rfc_randomCV3 = RandomizedSearchCV(estimator = rfc, 
                                  param_distributions = random_grid, 
                                  n_iter = 100, cv = cv, verbose = 2, 
                                  random_state = 18, n_jobs = -1, 
                                  scoring = 'accuracy')

# Fit the random search model
rfc_randomCV3.fit(X_train_Cluster, y_train_Cluster)

# examine the best model
print(rfc_randomCV3.best_score_)
print(rfc_randomCV3.best_params_)
print(rfc_randomCV3.best_estimator_)
Fitting 10 folds for each of 100 candidates, totalling 1000 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done 301 tasks      | elapsed:   26.7s
[Parallel(n_jobs=-1)]: Done 584 tasks      | elapsed:   51.2s
0.9947368421052631
{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': None, 'bootstrap': True}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  1.5min finished
In [66]:
# Create the parameter grid based on the results of random search
# based on the cluster extreme model
param_grid = {
    'bootstrap': [False],
    'max_depth': [135, 140, 145],
    'max_features': ['auto'],
    'min_samples_leaf': [1],
    'min_samples_split': [2, 3],
    'n_estimators': [725, 750, 775, 800, 825, 850, 875]
}

# Create a based model
rfc = RandomForestClassifier(random_state = 24)

# Instantiate the grid search model
rfc_gridCV3 = GridSearchCV(estimator = rfc, param_grid = param_grid, 
                                cv = cv, n_jobs = -1, verbose = 2, 
                                scoring = 'accuracy')

# Fit the grid search to the data
rfc_gridCV3.fit(X_train_Cluster, y_train_Cluster)

# examine the best model
print(rfc_gridCV3.best_score_)
print(rfc_gridCV3.best_params_)
print(rfc_gridCV3.best_estimator_)
Fitting 10 folds for each of 42 candidates, totalling 420 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done 301 tasks      | elapsed:   20.4s
[Parallel(n_jobs=-1)]: Done 420 out of 420 | elapsed:   28.0s finished
0.9921052631578947
{'bootstrap': False, 'max_depth': 135, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 725}
RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=135, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=725, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)
In [67]:
# get best parameters from the gridsearch for the clustered model of Q1 vs Q4
bestValues = rfc_gridCV3.best_params_

print("Best parameters set found on development set: {}".format(bestValues))

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Set model to best values found
cls = RandomForestClassifier(bootstrap = bestValues['bootstrap'], 
                             max_depth = bestValues['max_depth'], 
                             max_features = bestValues['max_features'], 
                             min_samples_leaf = bestValues['min_samples_leaf'], 
                             min_samples_split = bestValues['min_samples_split'], 
                             n_estimators = bestValues['n_estimators'])

classFit = cls.fit(X_train_Cluster, y_train_Cluster)

y_hat = cls.predict(X_test_Cluster)

# Train
cm = ConfusionMatrix(classFit)

# Predict test values
cm.predict(X_test_Cluster)

cm.score(X_test_Cluster, y_test_Cluster)

cm.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Classification Report
vis = ClassificationReport(cls)
vis.fit(X_train_Cluster, y_train_Cluster)
vis.score(X_test_Cluster, y_test_Cluster)
vis.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Prediction Error Report
vis = ClassPredictionError(cls)

# Fit the training data to the visualizer
vis.fit(X_train_Cluster, y_train_Cluster)

# Evaluate the model on the test data
vis.score(X_test_Cluster, y_test_Cluster)

# Draw visualization
vis.poof()
Best parameters set found on development set: {'bootstrap': False, 'max_depth': 135, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 725}
In [68]:
#feature analysis of clustered RF model

clf = RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=135, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)

clf.fit(X_Cluster, y_Cluster)

feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(dfExtremes[featCols].columns, clf.feature_importances_):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='barh', legend = False,
                                                  title = 'Gini Score')
plt.tight_layout()

Back to Top

VISUALIZE RESULTS

In [61]:
model = MiniBatchKMeans(n_clusters = 2, random_state = 42)
model.fit(dfExtremes[featCols])
cLabels = model.labels_

dfCluster = dfExtremes[featCols].join(pd.get_dummies(cLabels))

X_cluster = dfCluster

y_cluster = dfExtremes['ACT_Score_Quartiles']

dfCluster.drop(dfCluster.columns[[-1,-2,-3,-4]], axis = 1, inplace = True)
In [71]:
# Credit: https://stackoverflow.com/questions/26558816/matplotlib-scatter-plot-with-legend/26559256
# Credit: https://stackoverflow.com/questions/16834861/create-own-colormap-using-matplotlib-and-plot-color-scale
import matplotlib.patches as mpatches
import matplotlib.colors

dataRows = np.arange(0, dfCluster.shape[0])

classes = ['Bottom Q', 'Top Q']
class_colors = ['r', 'b']
recs = []
for i in range(0, len(class_colors)):
    recs.append(mpatches.Rectangle((0,0),1,1,fc=class_colors[i]))

cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ["r","b"])

for col in dfCluster:
    plt.figure()
    plt.scatter(dataRows, dfCluster[col], c=cLabels, cmap=cmap)
    plt.title("Feature Clustering: {}".format(col))
    plt.xlabel("School Instance")
    plt.ylabel(col)
    plt.legend(recs, classes)
    
    plt.tight_layout()

All Quartiles

In [62]:
#let's visualize how the clustering looks on one of the same correlated
#variables we looked at in our initial explaratory data analysis (spg score)

# training data
X_Dropped = dfDropped[featCols]

# target variable
y_Dropped = dfDropped['ACT_Score_Quartiles']

#all quartile best models first
kmeans_AllQ = KMeans(n_clusters=2, init='k-means++',random_state=42)
kmeans_AllQ.fit(X_Dropped)
kmeans_AllQ_labels = kmeans_AllQ.labels_ # the labels from kmeans clustering

minibkmeans_AllQ = MiniBatchKMeans(n_clusters=9, init='k-means++',random_state=42)
minibkmeans_AllQ.fit(X_Dropped)
minibkmeans_AllQ_labels = minibkmeans_AllQ.labels_ # the labels from kmeans clustering

dbs_AllQ =  DBSCAN(eps=24, min_samples = 10)
dbs_AllQ.fit(X_Dropped)
dbs_AllQ_labels = dbs_AllQ.labels_

aggc_AllQ = AgglomerativeClustering(n_clusters=9, linkage='ward')
aggc_AllQ.fit(X_Dropped)
aggc_AllQ_labels = aggc_AllQ.labels_


fig = plt.figure(figsize=(16,20))
title = ['DBSCAN, eps=24, minsamples=10',
         'HAC, clusters=9, linkage=ward',
         'KMEANS, clusters=2, init=kmeans++',
         'MiniBatch, clusters=9, init=kmeans++']

for i,l in enumerate([dbs_AllQ_labels, aggc_AllQ_labels, kmeans_AllQ_labels, minibkmeans_AllQ_labels]):
    
    plt.subplot(4,2,2*i+1)
    plt.scatter(x=dfDropped['SPG_Score'], y=dfDropped['ACT_Score'], c=l, 
                cmap=plt.cm.rainbow, s=20, linewidths=0)
    plt.xlabel('SPG_Score'), plt.ylabel('ACT_Score')
    plt.grid()
    plt.title(title[i])

plt.tight_layout()
plt.show()

The charts above visualize how each algorithm clustered based on what RF model yielded the most accurate model from their clustering labels. First, we are looking at the models that had to classify each of the four quartiles of average ACT scores at a given school. Then, we are looking at one of our original featured variables SPG score that was highly correlated with ACT score make the visualization easier.

For DBSCAN with eps = 24 and min_samples = 10, we get only 2 clusters that yielded a RF model with accuracy score 0.659. They are dispersed evenly throughout the graph; clearly, this variable is not what was used to separate the clusters due to their density overlapping with one another.

Agglomerative Clustering (aka Hierarchical Agglomerative Clustering), cluster = 9 and ward linkage, yielded a RF model with 0.655 accuracy. This graph looks somewhat like our original color coded quartile scatter plot with a few extra labels clustered throughout.

K-Means, cluster = 2 and initialization of KMeans++, yielded a fairly accurate RF model (accuracy 0.682). Unlike DBSCAN, this algorithm clustered the data approximately in half on the ACT scores with some overlap.

Mini-Batch K-Means, clusters = 9 and initialization of KMeans++, yielded the most accurate RF model (0.696). It looks like data was approximately clustered into 6 main clusters over the ACT scores, with the top quartile being clustered in many ways.

Top and Bottom Quartile Schools

In [63]:
#all Q1 vs Q4 clustered classification models now
dfExtremes = dfDropped[(dfDropped['ACT_Score_Quartiles'] == 0) | (dfDropped['ACT_Score_Quartiles'] == 3)]

X_Extremes = dfExtremes[featCols]

y_Extremes = dfExtremes['ACT_Score_Quartiles']

kmeans_Q1Q4 = KMeans(n_clusters=2, init='k-means++',random_state=42)
kmeans_Q1Q4.fit(X_Extremes)
kmeans_Q1Q4_labels = kmeans_Q1Q4.labels_ # the labels from kmeans clustering

minibkmeans_Q1Q4 = MiniBatchKMeans(n_clusters=6, init='k-means++',random_state=42)
minibkmeans_Q1Q4.fit(X_Extremes)
minibkmeans_Q1Q4_labels = minibkmeans_Q1Q4.labels_ # the labels from kmeans clustering

dbs_Q1Q4 =  DBSCAN(eps=24, min_samples = 10)
dbs_Q1Q4.fit(X_Extremes)
dbs_Q1Q4_labels = dbs_Q1Q4.labels_

aggc_Q1Q4 = AgglomerativeClustering(n_clusters=2, linkage='ward')
aggc_Q1Q4.fit(X_Extremes)
aggc_Q1Q4_labels = aggc_Q1Q4.labels_


fig = plt.figure(figsize=(16,20))
title = ['DBSCAN, eps=24, minsamples=10',
         'HAC, clusters=2, linkage=ward',
         'KMEANS, clusters=2, init=kmeans++',
         'MiniBatch, clusters=6, init=kmeans++']

for i,l in enumerate([dbs_Q1Q4_labels, aggc_Q1Q4_labels, kmeans_Q1Q4_labels, minibkmeans_Q1Q4_labels]):
    
    plt.subplot(4,2,2*i+1)
    plt.scatter(x=dfExtremes['SPG_Score'], y=dfExtremes['ACT_Score'], c=l, 
                cmap=plt.cm.rainbow, s=20, linewidths=0)
    plt.xlabel('SPG_Score'), plt.ylabel('ACT_Score')
    plt.grid()
    plt.title(title[i])

plt.tight_layout()
plt.show()

Next, we visualized how each algorithm clustered, based on what RF model yielded the most accurate results from only a bottom quartile and top quartile classification model for the average ACT score at a given school.

For DBSCAN with eps = 24 and min_samples = 10, we get 5 clusters that yielded a solidly accurate RF model (accuracy score 0.983). Two clusters compose the bottom quartile, and the remaining three are in the top quartile. This suggests that the top quartile schools have many more distinct relationships with other schools in the top quartile than do the bottom quartile schools.

Agglomerative Clustering (HAC), clusters = 2 with linkage = ward, yielded another accurate RF model (accuracy 0.987). Interestingly, HAC clustered all of the bottom quartile schools into one cluster. While that cluster also has some top quartile schools, the inverse for this is not true.

KMeans, clusters = 2 and initialization = KMeans++, yielded a RF model with accuracy 0.985. This clustering is nearly identical to the HAC but two schools in the majority top quartile cluster were labeled in the bottom quartile.

Mini-Batch K-Means, clusters = 6 and initialization of KMeans++, yielded a similarly accurate RF model (accuracy 0.987) to HAC. This is the most dispersed clusterings compared to the other three algorithms. Three clusters can be found in the bottom quartile and three in the top quartile as well. Two clusters in the bottom quartile have some schools labeled in the top quartile as well, but there are few.

Back to Top

SUMMARIZE RAMIFICATIONS

Previous analysis has shown us that by far the most important aspect of raising the percentage of students who enroll in college that come from a given high school in North Carolina is the school's average ACT score. Given this information we wanted to perform analysis on what plays the biggest role in a school's average ACT score. To do this we utilized clustering algorithms to further reduce the dimensionality of the data then a Random Forest classification algorithm that would yield a very accurate model for predicting the 0-25th percentile (first quartile) and the 75-100th percentile (fourth quartile) school's ACT scores. More importantly though it's output gives the most important inputs to the model and how much they are weighted in the final output. Therefore, an importance level can be deduced for attributes at a school that play a role in raising (or lowering) a school's average ACT score.

The most important factor (by nearly two fold) in a school's average ACT score is the school's performance grade score (SPG_Score). No information is given on how this school performance grade is generated so further research would need to be done if this bias's the model unfairly thereby effectively hiding other truly important factors. Regardless, if a school's performance grade is publicly available parents of students can use this information to help choose a public high school for their children.

The second most important factor is the average score by school for all students taking end of course subject tests based on college and career ready standards (EOC_Subject_CACR_All). This factor is likely even more important if we simplfied its weight into one variable by elminating the average score for white students in all subjects (EOCSubjects_CACR_White), male students in english 2 (EOCEnglish2_CACR_Male), and the average score for disabled students in biology (EOCBiology_CACR_SWD) which are all factors that help predict the average ACT score. These tests correlate well with ACT scores. A possibly explanation for this might include students learning addtional topics covered on the ACT or were exposed to material multiple times before seeing it again on the ACT. Regardless, it is clear that access to more advanced subjects will on average contribute to higher ACT scores.

The third most important factor is the average score by school for North Carolina Math 1 exam (NC_Math_1_Score). Similar statements about the end of course exams likely apply to the NC Math 1 exam as well.

Not including the other end of course exams listed previously, the next most important factor is the percentage of economically disadvantaged students (pct_eds). Our initial exploratory analysis showed that increasing the percentage of economically disadvantaged students has a negative correlation with ACT score. This shows that schools that are within wealthier areas perform better as a school on the ACT. This is likely due to the ability for these schools to obtain better staff and have more resources than schools who are in more economically disadvantaged areas and thus have more students who are economically disadvantaged. This is just speculation however and further research would be needed to confirm this.

The next most important factor is the education value-added assessment system growth score (EVAAS_Growth_Score). This is a system from SAS that models the growth based on common assessments schools administer for teachers and school administrators. Therefore, this is a proprietary metric across the state on how much better a school is becoming as a staff. This metric could then be used to show schools as they improve as a whole and thus correlates to and has weight in predicting how well a school's average ACT score is.

Another very weak factor is the number of short term suspensions per 100 students at the state level (st_short_susp_per_c_num). Since this is at a state level this value should be the same for every school in the data set therefore if we re-did this analysis it would be left out entirely.

A negligible factor is the number of National Board Certified Staff at the school level (nbpts_num). This would seem counter-intuitive, however this value is not normalized to how big the school is. Therefore, a school could have a very large number of teachers but few board certified staff while a smaller school could have all board certified staff and this would be weighed the same as input to the algorithm. If we re-did this analysis we would likely create a new feature to normalize this by the student population at the school. However now this does not weigh much into the model at all.

With these weights explained, the ramifications of our model are that schools can utilize these metrics to focus on improving their student's average ACT score. Improving their ACT score correlates strongly with their school's performance grade as well. From a student's parents point of view, this gives them insight into other metrics that could help a decision on which school to attend. The negative consequences is that this could cause parents to not want to attend certain schools. This would then drive down government payments to schools with lower attendances and cause them to become worse and potentially fail.

Back to Top

Deployment

The team originally set out to understand what might be influencing a “leaky pipeline” in the state of North Carolina as it relates to students progressing from high school into post-secondary education. While this analysis just begins to scratch the surface of understanding the various factors that both positively and negatively whether or not students progress to college, it is a solid start in understanding what might be driving forces for change.

Our goals are to identify features of schools that correlate with schools that perform well on the ACT on average and those schools that perform poorly on the ACT on average. This information will allow schools to focus on specific areas to ultimately improve their overall performance.

Furthermore, previous work has shown that schools with high ACT scores have a higher percentage of students who enroll in post-secondary education.

In our models, we chose to predict a school’s ACT score quartile based on 14 features. We believe a school’s performance on the ACT is a great indicator because all students in the NC public school system are administered the exam statewide. We achieved our goals by being able to correctly identify schools that perform well on the ACT on average and schools that perform poorly on the ACT on average with a 100% success rate.

A next phase of analysis would be to use additional historical data - for the 2016, 2015, and 2014 school years - to determine if the trends from our analysis are consistent in those years as well.

These models could be useful to both the State of North Carolina and the Belk Endowment, as they partner in devising a plan to increase post-secondary enrollment at the school level. This analysis, of course, is only one piece of the puzzle as it looks at a very limited set of factors. Additional work could be done to segment variables to correspond to the factors that the state or the endowment can influence, such as investment in teacher education and credentials or improvement in test preparation or school performance or targeted learning programs for different socioeconomic, demographic, or location-based factors. The performance of these models would be measured over time by evaluating individual school performance after investments in factors that the models indicate could be influenced.

Other interested parties could be public school systems in other states that administer the ACT to their students. There are 12 states, including North Carolina, in the US that require all students to take the ACT (source: https://blog.prepscholar.com/which-states-require-the-act-full-list-and-advice). These states may be interested in reviewing the features that could influence ACT performance for their own schools.

Assuming other states have pre-processed machine learning data available, we would deploy our model for interested parties by running recursive feature elimination to isolate important features, split the data using k-fold cross validation, run training models, and then run the test model when it’s ready to assess our performance. We would then share our results with the school district to identify areas of improvement for schools that under perform on the ACT on average.

The data currently being collected by the state is incredibly comprehensive and offers much insight, so there is no immediate recommendation for additional data to collect. The yearly nature of the data collection process makes this a long-term project, as there is not new data being added to the analysis at frequent intervals, and thus the models would be updated annually.

Back to Top

Exceptional Work (10 points total)

  • Hyperparameter tuning using GridSearchCV
  • Use of Yellowbrick graphs to refine the range of hyperparameters
  • New feature creation in ACT Score Quartiles for a response variable to classify on
  • Use custom metric to analyze clusters

To improve our models, beyond arbitrarily changing parameters and re-running models to attempt to lower the error rate, we used the package Yellowbrick to guide us visually to an optimal range of parameter values. For each hyperparameter we planned to tune, we set a range to visually check the Accuracy score for each value. The resulting smaller range of values can now be used to hypertune in RandomizedGridSearch and then GridSearchCV. GridSearchCV accepts ranges of values for parameters you plan to hypertune and make a model for each. It can then display the set of parameters that yielded the best score.

Since the Random Forest algorithm is much more computationally intensive, it is not feasible to feed a wide range of parameter values into GridSearchCV. Because of this, we initially used RandomizedGridSearchCV, which does not use the entire data to test the models like GridSerachCV. Instead, it samples a small portion of the data. We used a larger range of parameter values for RandomizedGridSearchCV to start and then refined that range based on the results to then be used in GridSearchCV. The hypertuned parameters we used for our Random Forest classification are:

  • n_estimators - This parameter adjusts how many decision trees are in the forest. This will always mean the more trees, the better the model. But as this increases, the trade-off is time, as the computations needed increases drastically.
  • max_depth - This parameter adjusts how deep the decision trees are allowed to have. If the * max depth is set to none then the nodes will expand until all the leaves are pure or when all the leaves have less than the min_samples_split samples.
  • min_samples_split - This parameter sets the minimum number of samples required to split a node.
  • min_samples_leaf - This parameter sets the minimum number of samples neeed to make a leaf node on both the left and right node of the split. This might affect model smoothing in regression.

In order to classify a continuous variable we broke the average ACT score at a school into quartiles. This allowed us to look at how well a Random Forest classification could accuractely predict a school's ACT score into any of the four quartiles. We also helped our accuracy drastically when we classified into only the bottom (Q1) and top (Q4) quartiles. Lastly, in order to compare four different clustering algorithms we used a custom metric. Rather than only using Silhouette score, we utilized a pipeline loop to find the best accuracy from a default Random Forest Classification algorithm. This quickly helped us analyze which cluster parameters would help provide us with a more accurate model rather than needing to rely on other less reliable metrics associated with clustering algorithms.

In [ ]: